I have this very simple Python Pandas DataFrame calles "sa"
>sa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 searchAppearance 7 non-null object
1 clicks 7 non-null int64
2 impressions 7 non-null int64
3 ctr 7 non-null float64
4 position 7 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 408.0+ bytes
with these values
>print(sa)
searchAppearance clicks impressions ctr position
0 AMP_TOP_STORIES 376 376 0.022917 8.108978
1 AMP_BLUE_LINK 55670 55670 0.051522 13.158574
2 PAGE_EXPERIENCE 68446 68446 0.039298 20.056293
3 RECIPE_FEATURE 40175 40175 0.042920 4.186674
4 RECIPE_RICH_SNIPPET 37428 37428 0.069153 18.726152
5 VIDEO 72 72 0.025361 15.896090
6 WEBLITE 1 1 0.001055 51.493671
all is good there.
now I do
sa['ctr-test']=devices['ctr']
this leads to
>print(sa)
searchAppearance clicks impressions ctr position ctr-test
0 AMP_TOP_STORIES 376 376 0.022917 8.108978 0.039522
1 AMP_BLUE_LINK 55670 55670 0.051522 13.158574 0.026543
2 PAGE_EXPERIENCE 68446 68446 0.039298 20.056293 0.051098
3 RECIPE_FEATURE 40175 40175 0.042920 4.186674 NaN
4 RECIPE_RICH_SNIPPET 37428 37428 0.069153 18.726152 NaN
5 VIDEO 72 72 0.025361 15.896090 NaN
6 WEBLITE 1 1 0.001055 51.493671 NaN
do you see all the NaN? but only starting from the 3rd row? it does not make any sense to me.
the dataframe info still looks good
sa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 searchAppearance 7 non-null object
1 clicks 7 non-null int64
2 impressions 7 non-null int64
3 ctr 7 non-null float64
4 position 7 non-null float64
5 ctr-test 3 non-null float64
dtypes: float64(3), int64(2), object(1)
memory usage: 464.0+ bytes
i don't get it. what is going wrong? I am using Google Collaborate.
Seems like a bug, but not in my code? Any idea on how to debug this? (If it's not in my code.)
The output of sa.info() includes this line:
5 ctr-test 3 non-null float64
It seems that these three non-null values end up in the first three rows of sa.
sa['ctr-test']=devices['ctr']
stupidity on my side, mix up in dataframes / variables
Related
I have a csv file that has three columns, one called (Age_Groups), one called (Trip_in_min) and the third is called (Start_Station_Name), (actually it comes from a bigger dataset (17 rows and 16845 columns)
Now I need to get the average trip time per age group
Here is the link to the csv file, in dropbox, as I did not know how to paste it properly here
Any help please?
import pandas as pd
file = pd.read_csv(r"file.csv")
# Counting total minutes per age group
trips_summary = (file.Age_Groups.value_counts())
print(("Number of trips per age group"))
print(trips_summary)# per age group
print()
# Finding the most popular 20 stations
popular_stations = (file.Start_Station_Name.value_counts())
print("The most popular 20 stations")
print(popular_stations[:20])
print()
UPDATE
Ok, it worked, I added the line
df.groupby('Age_Groups', as_index=False)['Trip_in_min'].mean()
Thanks #jjj, however as I mentioned, my data has more than 16K row, once I added back the rows, it started to fail and gives me the error below (might be not a real error), with only age groups and not average printed, I can get it only if I have 1890 rows or less, here is the message I am getting for larger number of rows (BTW), other operations work fine with the full DS, just this one):
*D:\Test 1.py:18: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
avg = df.groupby('Age_Groups', as_index=False)['Trip_in_min'].mean()
Age_Groups*
0 18-24
1 25-34
2 35-44
3 45-54
4 55-64
5 65-74
6 75+
UPDATE 2
Not all columns are numbers, however when I use the code below:
df.apply(pd.to_numeric, errors='ignore').info()
I get the below output(my target is number 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1897 entries, 1 to 1897
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Riverview Park 11 non-null object
1 Riverview Park.1 11 non-null object
2 Riverview Park.2 11 non-null object
3 Start_Station_Name 1897 non-null object
4 3251 98 non-null float64
5 Jersey & 3rd 98 non-null object
6 24443 98 non-null float64
7 Subscriber 98 non-null object
8 1928 98 non-null float64
9 Unnamed: 9 79 non-null float64
10 Age_Groups 1897 non-null object
11 136 98 non-null float64
12 Trip_in_min 1897 non-null object
dtypes: float64(5), object(8)
memory usage: 192.8+ KB
Hope this helps:
import pandas as pd
df= pd.read_csv("test.csv")
df.groupby('Age_Groups', as_index=False)['Trip_in_min'].mean()
I'm new to Python and programming, so this is no doubt a newbie question. I want to show the value counts for each unique value of each categorical variable in a data frame, but what I've written isn't working. I'm trying to avoid writing separate lines for each individual column if I can help it.
#
Column
Non-Null Count
Dtype
0
checking_balance
1000 non-null
category
1
months_loan_duration
1000 non-null
int64
2
credit_history
1000 non-null
category
3
purpose
1000 non-null
category
4
amount
1000 non-null
int64
5
savings_balance
1000 non-null
category
6
employment_duration
1000 non-null
category
7
percent_of_income
1000 non-null
int64
8
years_at_residence
1000 non-null
int64
9
age
1000 non-null
int64
10
other_credit
1000 non-null
category
11
housing
1000 non-null
category
12
existing_loans_count
1000 non-null
int64
13
job
1000 non-null
category
14
dependents
1000 non-null
int64
15
phone
1000 non-null
category
16
default
1000 non-null
category
Code I've written:
for col in creditData.columns:
if creditData[col].dtype == 'category':
print(creditData[col].value_counts())
The results:
unknown 394
< 0 DM 274
1 - 200 DM 269
> 200 DM 63
Name: checking_balance, dtype: int64
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-6a53236835fc> in <module>
1 for col in creditData.columns: # Loop through all columns in the dataframe
----> 2 if creditData[col].dtype == 'category':
3 print(creditData[col].value_counts())
TypeError: data type 'category' not understood
this works for me
for i in creditData.columns:
if creditData[i].dtype != 'int64':
print(creditData[i].value_counts())
I'm Using Mac. In my mac I Install Anaconda. I used Jupiter notebook 6.1.4 in this to work on data. For Learning purpose, I'm using Kaggle SF Salaries Dataset(https://www.kaggle.com/kaggle/sf-salaries).
After Importing the file in Jupyter Notebook & using the command df.info() it is showing specifications like this
>>>><class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 148654 non-null int64
1 EmployeeName 148654 non-null object
2 JobTitle 148654 non-null object
3 BasePay 148049 non-null object
4 OvertimePay 148654 non-null object
5 OtherPay 148654 non-null object
6 Benefits 112495 non-null object
7 TotalPay 148654 non-null float64
8 TotalPayBenefits 148654 non-null float64
9 Year 148654 non-null int64
10 Notes 0 non-null float64
11 Agency 148654 non-null object
12 Status 38119 non-null object
dtypes: float64(3), int64(2), object(8)
memory usage: 14.7+ MB.
In the environment of colab same data set is showing different specifications.
>>>>>>>><class 'pandas.core.frame.DataFrame'>
RangeIndex: 116475 entries, 0 to 116474
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 116475 non-null int64
1 EmployeeName 116475 non-null object
2 JobTitle 116475 non-null object
3 BasePay 115870 non-null float64
4 OvertimePay 116474 non-null float64
5 OtherPay 116474 non-null float64
6 Benefits 80315 non-null float64
7 TotalPay 116474 non-null float64
8 TotalPayBenefits 116474 non-null float64
9 Year 116474 non-null float64
10 Notes 0 non-null float64
11 Agency 116474 non-null object
12 Status 5943 non-null object
dtypes: float64(8), int64(1), object(4)
memory usage: 11.6+ MB.
enter image description here
The dataset is a csv file. The csv format is a plain text format: one line per row (normally delimited with a '\r\n'), each line containing fields separated with a delimiter (normally the comma ','), and optionaly enclosed in quotes.
But there is no indication for the datatypes. Dumb tools (text editors or LibreOffice calc) present the raw data to the user, so that the user may choose the datatypes, delimiters and encoding. Clever tools (Excel and in some sense Collab or Pandas) think that they can guess everything, either because they decide from what they think common or with some heuristics. So there is no surprise that they end with different guesses.
(If you have not guessed it, I hate Excel handling of csv files, and only rely on calc...)
I'm having trouble merging two dataframes in pandas. They are parts of a dataset split between two files, and they share some columns and values, namely 'name' and 'address'. The entries with identical values do not share their index with entries in the other file. I tried variations of the following line:
res = pd.merge(df, df_p, on=['name', 'address'], how="left")
When the how argument was set to 'left', the columns from df_p had no values. 'right' had the opposite effect, with columns from df being empty. 'inner' resulted in an empty dataframe and 'outer' duplicated the number of entries, essentially just appending the results of 'left' and 'right'.
I manually verified that there are identical combinations of 'name' and 'address' values in both files.
Edit: Attempt at merging on a single of those columns appears to be successful, however I want to avoid merging incorrect entries in case 2 people with identical names have different addresses and vice versa
Edit1: Here's some more information on the data-set.
df.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3983 entries, 0 to 3982
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3983 non-null int64
1 name 3983 non-null object
2 address 3983 non-null object
3 race 3970 non-null object
4 marital-status 3967 non-null object
5 occupation 3971 non-null object
6 pregnant 3969 non-null object
7 education-num 3965 non-null float64
8 relationship 3968 non-null object
9 skewness_glucose 3972 non-null float64
10 mean_glucose 3572 non-null float64
11 capital-gain 3972 non-null float64
12 kurtosis_glucose 3970 non-null float64
13 education 3968 non-null object
14 fnlwgt 3968 non-null float64
15 class 3969 non-null float64
16 std_glucose 3965 non-null float64
17 income 3974 non-null object
18 medical_info 3968 non-null object
19 native-country 3711 non-null object
20 hours-per-week 3971 non-null float64
21 capital-loss 3969 non-null float64
22 workclass 3968 non-null object
dtypes: float64(10), int64(1), object(12)
memory usage: 715.8+ KB
example entry from df:
0,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201", White, Married-civ-spouse, Exec-managerial,f,9.0, Husband,1.904881822,79.484375,15024.0,0.667177618, HS-grad,147707.0,0.0,39.49544760000001, >50K,"{'mean_oxygen':'1.501672241','std_oxygen':'13.33605383','kurtosis_oxygen':'11.36579476','skewness_oxygen':'156.77910559999995'}", United-States,60.0,0.0, Private
df_p.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3933 non-null int64
1 name 3933 non-null object
2 address 3933 non-null object
3 age 3933 non-null int64
4 sex 3933 non-null object
5 date_of_birth 3933 non-null object
dtypes: int64(2), object(4)
memory usage: 184.5+ KB
sample entry from df_p:
2273,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201",44, Male,1975-03-26
As you can see, the chosen samples are for the same person, but their index does not match, which is why I tried using the name and address columns.
Edit2: Changing the order of df and df_p in the merge seems to have solved the issue, though I have no clue why.
I have a DataFrame called result.
In the tutorial I'm watching Wes McKinney is getting the following return data when he executes a cell with just the name of the df in it - when I execute a cell with result in it I get the whole frame being returned.
Is there a pandas set_option I can use to swap between return info?
there is a display.large_repr option for that:
In [95]: pd.set_option('large_repr', 'info')
In [96]: df
Out[96]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
a 1000000 non-null int32
b 1000000 non-null int32
c 1000000 non-null int32
dtypes: int32(3)
memory usage: 11.4 MB
from docs:
display.large_repr : ‘truncate’/’info’
For DataFrames exceeding
max_rows/max_cols, the repr (and HTML repr) can show a truncated table
(the default from 0.13), or switch to the view from df.info() (the
behaviour in earlier versions of pandas).
[default: truncate]
[currently: truncate]
PS You may also want to read about frequently used pandas options
but IMO it would be much more convenient and more conscious to use .info() function:
result.info()
demo:
In [92]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
a 1000000 non-null int32
b 1000000 non-null int32
c 1000000 non-null int32
dtypes: int32(3)
memory usage: 11.4 MB
In [93]: df.head()
Out[93]:
a b c
0 1 0 1
1 6 1 9
2 5 2 3
3 6 4 3
4 8 9 2
df.info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None) method will give you what you want. By default, information about the number of values and the size of the dataframe will be displayed. The documentation is here. The verbose setting might be particularly useful for larger datasets as it shows full output including number of notnull values.
Default:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 10 entries, 0 to 9
dtypes: float64(10)
memory usage: 7.9 KB
With verbose = True:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 10 columns):
0 100 non-null float64
1 100 non-null float64
2 100 non-null float64
3 100 non-null float64
4 100 non-null float64
5 100 non-null float64
6 100 non-null float64
7 100 non-null float64
8 100 non-null float64
9 100 non-null float64
dtypes: float64(10)
memory usage: 7.9 KB