I am trying to get the count of unique values in a column using groupby. However I get different results if I just look at groupby results (which are printed for every group) and if I use get_group(). But the results I get is not same for the first method. What is the problem here?
print "Groupby:",bigDF[bigDF.Class == "apple"].groupby('sizeBin').customerId.nunique()
print "Selection:",bigDF[(bigDF.Class == "apple")&(bigDF.sizeBin == 0)].customerId.nunique()
print "Get group:",bigDF[bigDF.Class == "apple"].groupby('sizeBin').get_group(0).customerId.nunique()
Groupby: sizeBin
0 6
1 14
5 26
10 34
20 32
50 3
100 3
200 7
500 0
Name: customerId, dtype: int64
Selection: 34
Get group: 34
I should also note the data types, pd.info() gives me the following, so the sizeBin is a category:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 224903 entries, 0 to 20616
Data columns (total 3 columns):
customerId 224903 non-null int64
Class 224903 non-null object
sizeBin 224903 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 5.4+ MB
Related
I have a csv file that has three columns, one called (Age_Groups), one called (Trip_in_min) and the third is called (Start_Station_Name), (actually it comes from a bigger dataset (17 rows and 16845 columns)
Now I need to get the average trip time per age group
Here is the link to the csv file, in dropbox, as I did not know how to paste it properly here
Any help please?
import pandas as pd
file = pd.read_csv(r"file.csv")
# Counting total minutes per age group
trips_summary = (file.Age_Groups.value_counts())
print(("Number of trips per age group"))
print(trips_summary)# per age group
print()
# Finding the most popular 20 stations
popular_stations = (file.Start_Station_Name.value_counts())
print("The most popular 20 stations")
print(popular_stations[:20])
print()
UPDATE
Ok, it worked, I added the line
df.groupby('Age_Groups', as_index=False)['Trip_in_min'].mean()
Thanks #jjj, however as I mentioned, my data has more than 16K row, once I added back the rows, it started to fail and gives me the error below (might be not a real error), with only age groups and not average printed, I can get it only if I have 1890 rows or less, here is the message I am getting for larger number of rows (BTW), other operations work fine with the full DS, just this one):
*D:\Test 1.py:18: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
avg = df.groupby('Age_Groups', as_index=False)['Trip_in_min'].mean()
Age_Groups*
0 18-24
1 25-34
2 35-44
3 45-54
4 55-64
5 65-74
6 75+
UPDATE 2
Not all columns are numbers, however when I use the code below:
df.apply(pd.to_numeric, errors='ignore').info()
I get the below output(my target is number 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1897 entries, 1 to 1897
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Riverview Park 11 non-null object
1 Riverview Park.1 11 non-null object
2 Riverview Park.2 11 non-null object
3 Start_Station_Name 1897 non-null object
4 3251 98 non-null float64
5 Jersey & 3rd 98 non-null object
6 24443 98 non-null float64
7 Subscriber 98 non-null object
8 1928 98 non-null float64
9 Unnamed: 9 79 non-null float64
10 Age_Groups 1897 non-null object
11 136 98 non-null float64
12 Trip_in_min 1897 non-null object
dtypes: float64(5), object(8)
memory usage: 192.8+ KB
Hope this helps:
import pandas as pd
df= pd.read_csv("test.csv")
df.groupby('Age_Groups', as_index=False)['Trip_in_min'].mean()
All this is asking me to do is write a code that shows if there are any missing values where it is not the customers first order. I have provided the DataFrame. Should I use column 'Order_number" instead? Is my code wrong?
I named the DataFrame df_orders.
I thought my code would find the columns that have missing values and a greater order number than 1.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478967 entries, 0 to 478966
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 478967 non-null int64
1 user_id 478967 non-null int64
2 order_number 478967 non-null int64
3 order_dow 478967 non-null int64
4 order_hour_of_day 478967 non-null int64
5 days_since_prior_order 450148 non-null float64
dtypes: float64(1), int64(5)
memory usage: 21.9 MB
None
# Are there any missing values where it's not a customer's first order?
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna() > 1]
print(m_v_fo.head())
Empty DataFrame
Columns: [order_id, user_id, order_number, order_dow, order_hour_of_day,
days_since_prior_order]
Index: []
When you say .isna() you are returning a series of True or False. So that will never be > 1
Instead, try this:
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna().sum() > 1]
If that doesn't solve the problem, then I'm not sure - try editing your question to add more detail and I can try again. :)
Update: I read your question again, and I think you're doing this out of order. First you need to filter on days_since_prior_order and then look for na.
m_v_fo = df_orders[df_orders['days_since_prior_order'] > 1].isna()
I have this very simple Python Pandas DataFrame calles "sa"
>sa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 searchAppearance 7 non-null object
1 clicks 7 non-null int64
2 impressions 7 non-null int64
3 ctr 7 non-null float64
4 position 7 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 408.0+ bytes
with these values
>print(sa)
searchAppearance clicks impressions ctr position
0 AMP_TOP_STORIES 376 376 0.022917 8.108978
1 AMP_BLUE_LINK 55670 55670 0.051522 13.158574
2 PAGE_EXPERIENCE 68446 68446 0.039298 20.056293
3 RECIPE_FEATURE 40175 40175 0.042920 4.186674
4 RECIPE_RICH_SNIPPET 37428 37428 0.069153 18.726152
5 VIDEO 72 72 0.025361 15.896090
6 WEBLITE 1 1 0.001055 51.493671
all is good there.
now I do
sa['ctr-test']=devices['ctr']
this leads to
>print(sa)
searchAppearance clicks impressions ctr position ctr-test
0 AMP_TOP_STORIES 376 376 0.022917 8.108978 0.039522
1 AMP_BLUE_LINK 55670 55670 0.051522 13.158574 0.026543
2 PAGE_EXPERIENCE 68446 68446 0.039298 20.056293 0.051098
3 RECIPE_FEATURE 40175 40175 0.042920 4.186674 NaN
4 RECIPE_RICH_SNIPPET 37428 37428 0.069153 18.726152 NaN
5 VIDEO 72 72 0.025361 15.896090 NaN
6 WEBLITE 1 1 0.001055 51.493671 NaN
do you see all the NaN? but only starting from the 3rd row? it does not make any sense to me.
the dataframe info still looks good
sa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 searchAppearance 7 non-null object
1 clicks 7 non-null int64
2 impressions 7 non-null int64
3 ctr 7 non-null float64
4 position 7 non-null float64
5 ctr-test 3 non-null float64
dtypes: float64(3), int64(2), object(1)
memory usage: 464.0+ bytes
i don't get it. what is going wrong? I am using Google Collaborate.
Seems like a bug, but not in my code? Any idea on how to debug this? (If it's not in my code.)
The output of sa.info() includes this line:
5 ctr-test 3 non-null float64
It seems that these three non-null values end up in the first three rows of sa.
sa['ctr-test']=devices['ctr']
stupidity on my side, mix up in dataframes / variables
I am trying to create a new pandas dataframe displayDF with 4 columns from the dataframe finalDF.
displayDF = finalDF[['False','True','RULE ID','RULE NAME']]
This command is failing with the error:
KeyError: "['False', 'True'] not in index"
However, I can see the columns "False" and "True" when I run finalDF.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 11
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rule_rec_id 12 non-null object
1 False 12 non-null int64
2 True 12 non-null int64
3 RULE ID 12 non-null object
4 RULE NAME 12 non-null object
5 RULE DESCRIPTION 12 non-null object
dtypes: int64(2), object(4)
memory usage: 672.0+ bytes
Additional Background:
I created finalDF by merging two dataframes (pivot_stackedPandasDF and dfPandaDescriptions)
finalDF = pd.merge(pivot_stackedPandasDF, dfPandaDescriptions, how='left', left_on=['rule_rec_id'], right_on=['RULE ID'])
I created pivot_stackedPandasDF with this command.
pivot_stackedPandasDF = stackedPandasDF.pivot_table(index="rule_rec_id", columns="alert_value", values="count").reset_index()
I think the root cause may be in the way I ran the .pivot_table() command.
I have a DataFrame called result.
In the tutorial I'm watching Wes McKinney is getting the following return data when he executes a cell with just the name of the df in it - when I execute a cell with result in it I get the whole frame being returned.
Is there a pandas set_option I can use to swap between return info?
there is a display.large_repr option for that:
In [95]: pd.set_option('large_repr', 'info')
In [96]: df
Out[96]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
a 1000000 non-null int32
b 1000000 non-null int32
c 1000000 non-null int32
dtypes: int32(3)
memory usage: 11.4 MB
from docs:
display.large_repr : ‘truncate’/’info’
For DataFrames exceeding
max_rows/max_cols, the repr (and HTML repr) can show a truncated table
(the default from 0.13), or switch to the view from df.info() (the
behaviour in earlier versions of pandas).
[default: truncate]
[currently: truncate]
PS You may also want to read about frequently used pandas options
but IMO it would be much more convenient and more conscious to use .info() function:
result.info()
demo:
In [92]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
a 1000000 non-null int32
b 1000000 non-null int32
c 1000000 non-null int32
dtypes: int32(3)
memory usage: 11.4 MB
In [93]: df.head()
Out[93]:
a b c
0 1 0 1
1 6 1 9
2 5 2 3
3 6 4 3
4 8 9 2
df.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None) method will give you what you want. By default, information about the number of values and the size of the dataframe will be displayed. The documentation is here. The verbose setting might be particularly useful for larger datasets as it shows full output including number of notnull values.
Default:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 10 entries, 0 to 9
dtypes: float64(10)
memory usage: 7.9 KB
With verbose = True:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 10 columns):
0 100 non-null float64
1 100 non-null float64
2 100 non-null float64
3 100 non-null float64
4 100 non-null float64
5 100 non-null float64
6 100 non-null float64
7 100 non-null float64
8 100 non-null float64
9 100 non-null float64
dtypes: float64(10)
memory usage: 7.9 KB