I am trying to create a new pandas dataframe displayDF with 4 columns from the dataframe finalDF.
displayDF = finalDF[['False','True','RULE ID','RULE NAME']]
This command is failing with the error:
KeyError: "['False', 'True'] not in index"
However, I can see the columns "False" and "True" when I run finalDF.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 11
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rule_rec_id 12 non-null object
1 False 12 non-null int64
2 True 12 non-null int64
3 RULE ID 12 non-null object
4 RULE NAME 12 non-null object
5 RULE DESCRIPTION 12 non-null object
dtypes: int64(2), object(4)
memory usage: 672.0+ bytes
Additional Background:
I created finalDF by merging two dataframes (pivot_stackedPandasDF and dfPandaDescriptions)
finalDF = pd.merge(pivot_stackedPandasDF, dfPandaDescriptions, how='left', left_on=['rule_rec_id'], right_on=['RULE ID'])
I created pivot_stackedPandasDF with this command.
pivot_stackedPandasDF = stackedPandasDF.pivot_table(index="rule_rec_id", columns="alert_value", values="count").reset_index()
I think the root cause may be in the way I ran the .pivot_table() command.
Related
All this is asking me to do is write a code that shows if there are any missing values where it is not the customers first order. I have provided the DataFrame. Should I use column 'Order_number" instead? Is my code wrong?
I named the DataFrame df_orders.
I thought my code would find the columns that have missing values and a greater order number than 1.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478967 entries, 0 to 478966
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 478967 non-null int64
1 user_id 478967 non-null int64
2 order_number 478967 non-null int64
3 order_dow 478967 non-null int64
4 order_hour_of_day 478967 non-null int64
5 days_since_prior_order 450148 non-null float64
dtypes: float64(1), int64(5)
memory usage: 21.9 MB
None
# Are there any missing values where it's not a customer's first order?
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna() > 1]
print(m_v_fo.head())
Empty DataFrame
Columns: [order_id, user_id, order_number, order_dow, order_hour_of_day,
days_since_prior_order]
Index: []
When you say .isna() you are returning a series of True or False. So that will never be > 1
Instead, try this:
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna().sum() > 1]
If that doesn't solve the problem, then I'm not sure - try editing your question to add more detail and I can try again. :)
Update: I read your question again, and I think you're doing this out of order. First you need to filter on days_since_prior_order and then look for na.
m_v_fo = df_orders[df_orders['days_since_prior_order'] > 1].isna()
I have a dataframe that has 3 columns and looks like this:
name date result
Anya 2021-02-13 0
Frank 2021-02-14 1
The other dataframe looks like this:
name date
Anya 2021-02-13
Frank 2021-02-14
I need to match the data types of one df to another. Because I have one additional column in df_1 I got an error. My code looks like this:
df_1.info()
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 717 non-null object
1 date 717 non-null object
2 result 717 non-null int64
df_2.info()
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 717 non-null object
1 date 717 non-null datetime64[ns]
# Match the primary df to secondary df
for x in df_1.columns:
df_2[x] = df_2[x].astype(df_1[x].dtypes.name)
I got an error: KeyError: 'profitable' What would be a workaround here? I need the dtypes of df_2 to be exactly the same as df_1. Thanks!
df1->that has 3 columns
df2->other dataframe
Firstly make use of boolean mask to find out those columns which are common in both dataframes:
mask=df1.columns.isin(df2.columns)
df=df1[df1.columns[mask]]
Now finally make use of astype() method:
df2=df2.astype(df.dtypes)
Or you can do all this in 1 line by:
df2=df2.astype(df1[df1.columns[df1.columns.isin(df2.columns)]].dtypes)
This question already has answers here:
Replacing blank values (white space) with NaN in pandas
(13 answers)
Closed 2 years ago.
so, I've been working with pandas in python and I got extracted data from external system with lots of spaces at the end of each column. I got an idea to use on each Series a str.strip() method with a code:
Data["DESCRIPTION"] = Data["DESCRIPTION"].str.strip()
It basically did its job but I noticed that when I check properties of data frame using I run into an issue that if in one value there were only spaces without any text then it is empty but it does not convert that scalar as null:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18028 entries, 0 to 18027
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 VIN 18028 non-null object
1 DESCRIPTION 18028 non-null object
2 DESCRIPTION 2 18028 non-null object
3 ENGINE 18023 non-null object
4 TRANSMISSION 18028 non-null object
5 PAINT 18028 non-null object
6 EXT_COLOR_CODE 18028 non-null object
7 EXT_COLOR_DESC 18028 non-null object
8 INT_COLOR_DESC 18028 non-null object
9 COUNTRY 18028 non-null object
10 PROD_DATE 18028 non-null object
dtypes: object(11)
memory usage: 1.5+ MB
However checking a condition if the string is empty:
Data['DESCRIPTION 2'] == ""
0 True
1 True
2 True
3 True
4 True
...
18023 True
18024 True
18025 True
18026 True
18027 True
Name: DESCRIPTION 2, Length: 18028, dtype: bool
How could I possibly convert all those as null so I could drop them using dropna() function?
I'd be grateful for any suggestions.
To remove trailing spaces and replace an empty string or records with only spaces as Nan run the below command.
Data["DESCRIPTION"].str.strip().replace(r'^\s*$', np.nan, regex=True)
Please refer to this page Replacing blank values (white space) with NaN in pandas
I'm having trouble merging two dataframes in pandas. They are parts of a dataset split between two files, and they share some columns and values, namely 'name' and 'address'. The entries with identical values do not share their index with entries in the other file. I tried variations of the following line:
res = pd.merge(df, df_p, on=['name', 'address'], how="left")
When the how argument was set to 'left', the columns from df_p had no values. 'right' had the opposite effect, with columns from df being empty. 'inner' resulted in an empty dataframe and 'outer' duplicated the number of entries, essentially just appending the results of 'left' and 'right'.
I manually verified that there are identical combinations of 'name' and 'address' values in both files.
Edit: Attempt at merging on a single of those columns appears to be successful, however I want to avoid merging incorrect entries in case 2 people with identical names have different addresses and vice versa
Edit1: Here's some more information on the data-set.
df.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3983 entries, 0 to 3982
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3983 non-null int64
1 name 3983 non-null object
2 address 3983 non-null object
3 race 3970 non-null object
4 marital-status 3967 non-null object
5 occupation 3971 non-null object
6 pregnant 3969 non-null object
7 education-num 3965 non-null float64
8 relationship 3968 non-null object
9 skewness_glucose 3972 non-null float64
10 mean_glucose 3572 non-null float64
11 capital-gain 3972 non-null float64
12 kurtosis_glucose 3970 non-null float64
13 education 3968 non-null object
14 fnlwgt 3968 non-null float64
15 class 3969 non-null float64
16 std_glucose 3965 non-null float64
17 income 3974 non-null object
18 medical_info 3968 non-null object
19 native-country 3711 non-null object
20 hours-per-week 3971 non-null float64
21 capital-loss 3969 non-null float64
22 workclass 3968 non-null object
dtypes: float64(10), int64(1), object(12)
memory usage: 715.8+ KB
example entry from df:
0,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201", White, Married-civ-spouse, Exec-managerial,f,9.0, Husband,1.904881822,79.484375,15024.0,0.667177618, HS-grad,147707.0,0.0,39.49544760000001, >50K,"{'mean_oxygen':'1.501672241','std_oxygen':'13.33605383','kurtosis_oxygen':'11.36579476','skewness_oxygen':'156.77910559999995'}", United-States,60.0,0.0, Private
df_p.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3933 non-null int64
1 name 3933 non-null object
2 address 3933 non-null object
3 age 3933 non-null int64
4 sex 3933 non-null object
5 date_of_birth 3933 non-null object
dtypes: int64(2), object(4)
memory usage: 184.5+ KB
sample entry from df_p:
2273,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201",44, Male,1975-03-26
As you can see, the chosen samples are for the same person, but their index does not match, which is why I tried using the name and address columns.
Edit2: Changing the order of df and df_p in the merge seems to have solved the issue, though I have no clue why.
I am trying to get the count of unique values in a column using groupby. However I get different results if I just look at groupby results (which are printed for every group) and if I use get_group(). But the results I get is not same for the first method. What is the problem here?
print "Groupby:",bigDF[bigDF.Class == "apple"].groupby('sizeBin').customerId.nunique()
print "Selection:",bigDF[(bigDF.Class == "apple")&(bigDF.sizeBin == 0)].customerId.nunique()
print "Get group:",bigDF[bigDF.Class == "apple"].groupby('sizeBin').get_group(0).customerId.nunique()
Groupby: sizeBin
0 6
1 14
5 26
10 34
20 32
50 3
100 3
200 7
500 0
Name: customerId, dtype: int64
Selection: 34
Get group: 34
I should also note the data types, pd.info() gives me the following, so the sizeBin is a category:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 224903 entries, 0 to 20616
Data columns (total 3 columns):
customerId 224903 non-null int64
Class 224903 non-null object
sizeBin 224903 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 5.4+ MB