Is there something like head(), summary() of R in python? [duplicate] - python

I want to preview a Pandas dataframe. I would use head(mymatrix) in R, but I do not know how to do this in Pandas Python.
When I type
df.head(10) I get...
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 14 columns):
#Book_Date 10 non-null values
Item_Qty 10 non-null values
Item_id 10 non-null values
Location_id 10 non-null values
MFG_Discount 10 non-null values
Sale_Revenue 10 non-null values
Sales_Flg 10 non-null values
Sell_Unit_Cost 5 non-null values
Store_Discount 10 non-null values
Transaction_Id 10 non-null values
Unit_Cost_Amt 10 non-null values
Unit_Received_Cost 5 non-null values
Unnamed: 0 10 non-null values
Weight 10 non-null values

Suppose you want to output the first and last 10 rows of the iris data set.
In R:
data(iris)
head(iris, 10)
tail(iris, 10)
In Python (scikit-learn required to load the iris data set):
import pandas as pd
from sklearn import datasets
iris = pd.DataFrame(datasets.load_iris().data)
iris.head(10)
iris.tail(10)
Now, as previously answered, if your data frame is too large for the display you use in the terminal, a summary is output. To visualize your data in a terminal, you could either expend the terminal or reduce the number of columns to display, as follows.
iris.iloc[:,1:2].head(10)
EDIT. Changed .ix to .iloc. From the pandas documentation,
Starting in 0.20.0, the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

Related

average value of column with different keys, pandas

I have a csv file that has three columns, one called (Age_Groups), one called (Trip_in_min) and the third is called (Start_Station_Name), (actually it comes from a bigger dataset (17 rows and 16845 columns)
Now I need to get the average trip time per age group
Here is the link to the csv file, in dropbox, as I did not know how to paste it properly here
Any help please?
import pandas as pd
file = pd.read_csv(r"file.csv")
# Counting total minutes per age group
trips_summary = (file.Age_Groups.value_counts())
print(("Number of trips per age group"))
print(trips_summary)# per age group
print()
# Finding the most popular 20 stations
popular_stations = (file.Start_Station_Name.value_counts())
print("The most popular 20 stations")
print(popular_stations[:20])
print()
UPDATE
Ok, it worked, I added the line
df.groupby('Age_Groups', as_index=False)['Trip_in_min'].mean()
Thanks #jjj, however as I mentioned, my data has more than 16K row, once I added back the rows, it started to fail and gives me the error below (might be not a real error), with only age groups and not average printed, I can get it only if I have 1890 rows or less, here is the message I am getting for larger number of rows (BTW), other operations work fine with the full DS, just this one):
*D:\Test 1.py:18: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
avg = df.groupby('Age_Groups', as_index=False)['Trip_in_min'].mean()
Age_Groups*
0 18-24
1 25-34
2 35-44
3 45-54
4 55-64
5 65-74
6 75+
UPDATE 2
Not all columns are numbers, however when I use the code below:
df.apply(pd.to_numeric, errors='ignore').info()
I get the below output(my target is number 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1897 entries, 1 to 1897
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Riverview Park 11 non-null object
1 Riverview Park.1 11 non-null object
2 Riverview Park.2 11 non-null object
3 Start_Station_Name 1897 non-null object
4 3251 98 non-null float64
5 Jersey & 3rd 98 non-null object
6 24443 98 non-null float64
7 Subscriber 98 non-null object
8 1928 98 non-null float64
9 Unnamed: 9 79 non-null float64
10 Age_Groups 1897 non-null object
11 136 98 non-null float64
12 Trip_in_min 1897 non-null object
dtypes: float64(5), object(8)
memory usage: 192.8+ KB
Hope this helps:
import pandas as pd
df= pd.read_csv("test.csv")
df.groupby('Age_Groups', as_index=False)['Trip_in_min'].mean()

Why am I getting an empty index?

All this is asking me to do is write a code that shows if there are any missing values where it is not the customers first order. I have provided the DataFrame. Should I use column 'Order_number" instead? Is my code wrong?
I named the DataFrame df_orders.
I thought my code would find the columns that have missing values and a greater order number than 1.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478967 entries, 0 to 478966
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 478967 non-null int64
1 user_id 478967 non-null int64
2 order_number 478967 non-null int64
3 order_dow 478967 non-null int64
4 order_hour_of_day 478967 non-null int64
5 days_since_prior_order 450148 non-null float64
dtypes: float64(1), int64(5)
memory usage: 21.9 MB
None
# Are there any missing values where it's not a customer's first order?
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna() > 1]
print(m_v_fo.head())
Empty DataFrame
Columns: [order_id, user_id, order_number, order_dow, order_hour_of_day,
days_since_prior_order]
Index: []
When you say .isna() you are returning a series of True or False. So that will never be > 1
Instead, try this:
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna().sum() > 1]
If that doesn't solve the problem, then I'm not sure - try editing your question to add more detail and I can try again. :)
Update: I read your question again, and I think you're doing this out of order. First you need to filter on days_since_prior_order and then look for na.
m_v_fo = df_orders[df_orders['days_since_prior_order'] > 1].isna()

Add new columns to existing dataframe with loops and conditions

I have two dataframes. One is excel file and another will be created by user inputs. Based on the user inputs and conditions on columns in the 1st dataframe, new columns should be added to 1st dataframe with calculations. I have wrote the code, which was successful for the test data, but the results are not coming to dataframe. Any help?
1st Dataframe:
Data columns (total 9 columns):
Column Non-Null Count Dtype
0 DDO Code 8621 non-null object
1 ULB Name 8621 non-null object
2 Dist. 8621 non-null object
3 Div. 8621 non-null object
4 Kgid No 8621 non-null int64
5 Name Of The Official 8621 non-null object
6 PRAN Number 8621 non-null float64
7 Join Date 8621 non-null datetime64[ns]
8 Present Basic 8621 non-null int64
dtypes: datetime64ns, float64(1), int64(2), object(5)
2nd Dataframe will be created by user inputs:
enter image description here
from the above data, I need to append 'n' columns based on the user inputs with loops and condition.
here is the code:
for a,b in zip(month_data.month_list, month_data.month_range):
for i,x in zip(contr_calc_new["Join Date"],contr_calc_new['Present Basic']):
if i.date().strftime('%Y-%m') == b.date().strftime('%Y-%m'):
contr_calc_new[a] = 0
else:
contr_calc_new[a] = int(((x + (x*rate)//100)*14//100))
this code is working for test data, but the results are not appending to the 1st dataframe by the calculation based on 2nd dataframe.
i need the result should be like below:
if [join date] column is equal to year & month entered by user, it must return zero, else it should return some calculation. Advance thanks for the help.
Finally I found the proper code. Thank you for your replies.
for a,b in zip(month_data.month_list, month_data.month_range):
contr_calc_new[a] = np.where(contr_calc_new['Join Date'].dt.strftime('%Y-%m') == b.date().strftime('%Y-%m'),0,((contr_calc_new['Present Basic'] + (contr_calc_new['Present Basic']*da_rate)//100)*14//100).astype(int))

Why Are Some Columns "Not In Index" When Creating a New Dataframe?

I am trying to create a new pandas dataframe displayDF with 4 columns from the dataframe finalDF.
displayDF = finalDF[['False','True','RULE ID','RULE NAME']]
This command is failing with the error:
KeyError: "['False', 'True'] not in index"
However, I can see the columns "False" and "True" when I run finalDF.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 11
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rule_rec_id 12 non-null object
1 False 12 non-null int64
2 True 12 non-null int64
3 RULE ID 12 non-null object
4 RULE NAME 12 non-null object
5 RULE DESCRIPTION 12 non-null object
dtypes: int64(2), object(4)
memory usage: 672.0+ bytes
Additional Background:
I created finalDF by merging two dataframes (pivot_stackedPandasDF and dfPandaDescriptions)
finalDF = pd.merge(pivot_stackedPandasDF, dfPandaDescriptions, how='left', left_on=['rule_rec_id'], right_on=['RULE ID'])
I created pivot_stackedPandasDF with this command.
pivot_stackedPandasDF = stackedPandasDF.pivot_table(index="rule_rec_id", columns="alert_value", values="count").reset_index()
I think the root cause may be in the way I ran the .pivot_table() command.

groupby and get_group does not give the same result

I am trying to get the count of unique values in a column using groupby. However I get different results if I just look at groupby results (which are printed for every group) and if I use get_group(). But the results I get is not same for the first method. What is the problem here?
print "Groupby:",bigDF[bigDF.Class == "apple"].groupby('sizeBin').customerId.nunique()
print "Selection:",bigDF[(bigDF.Class == "apple")&(bigDF.sizeBin == 0)].customerId.nunique()
print "Get group:",bigDF[bigDF.Class == "apple"].groupby('sizeBin').get_group(0).customerId.nunique()
Groupby: sizeBin
0 6
1 14
5 26
10 34
20 32
50 3
100 3
200 7
500 0
Name: customerId, dtype: int64
Selection: 34
Get group: 34
I should also note the data types, pd.info() gives me the following, so the sizeBin is a category:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 224903 entries, 0 to 20616
Data columns (total 3 columns):
customerId 224903 non-null int64
Class 224903 non-null object
sizeBin 224903 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 5.4+ MB

Categories