Hear is my original dataframe columns type:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NAME 23605 non-null object
1 DEPARTMENT_NAME 23605 non-null object
2 TITLE 23605 non-null object
3 REGULAR 21939 non-null object
4 RETRO 13643 non-null object
5 OTHER 13351 non-null object
6 OVERTIME 6826 non-null object
7 INJURED 1312 non-null object
8 DETAIL 2355 non-null object
9 QUINN/EDUCATION INCENTIVE 1351 non-null object
10 TOTAL EARNINGS 23605 non-null object
11 POSTAL 23605 non-null object
I want to convert some of them into float type, say Total earnings, I tried:
df['TOTAL EARNINGS'] = df['TOTAL EARNINGS'].astype(int)
and
df['TOTAL EARNINGS'] = pd.to_numeric(df['TOTAL EARNINGS'])
But I got:
ValueError: setting an array element with a sequence.
or
TypeError: Invalid object type at position 0
And I don't know why, is there any other methods to do so?
Here is my data: https://data.boston.gov/dataset/418983dc-7cae-42bb-88e4-d56f5adcf869/resource/31358fd1-849a-48e0-8285-e813f6efbdf1/download/employeeearningscy18full.csv
Here are some pictures of my dataframe:
enter image description here
enter image description here
enter image description here
This happens because your original data has 2 rows which are completely text.
First execute command below to clean those rows.
df = df[df["TOTAL EARNINGS"]!="TOTAL EARNINGS"]
Then, change the datatype
df['TOTAL EARNINGS'] = df['TOTAL EARNINGS'].astype(float)
You can check datatypes thereafter as
df.dtypes
Related
I am trying to read in text file from EIA that is zipped. I have been able to get the file downloaded, unzipped, and converted to a string that is I believe JSON formatted but can not seem to convert it into a DataFrame. Help is greatly appreciated.
import pandas as pd
import requests
import io
import zipfile
import json
url_data='https://api.eia.gov/bulk/PET.zip'
r = requests.get(url_data)
with zipfile.ZipFile(io.BytesIO(r.content), mode="r") as archive:
archive.printdir()
text = archive.read("PET.txt") .decode(encoding="utf-8")
To read this file use :
import pandas as pd
df=pd.read_json(path_to_zip,lines=True)
df contains all rows
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188297 entries, 0 to 188296
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 series_id 174220 non-null object
1 name 188297 non-null object
2 units 174220 non-null object
3 f 174220 non-null object
4 unitsshort 174220 non-null object
5 description 174220 non-null object
6 copyright 174220 non-null object
7 source 174220 non-null object
8 iso3166 134979 non-null object
9 geography 161968 non-null object
10 start 174220 non-null float64
11 end 174220 non-null float64
12 last_updated 174220 non-null object
13 data 174220 non-null object
14 geography2 105177 non-null object
15 category_id 14077 non-null float64
16 parent_category_id 14077 non-null float64
17 notes 14077 non-null object
18 childseries 14077 non-null object
read_json can already read compressed JSON files. This isn't a JSON file though, it contains one JSON document per line. You can read such files with the lines parameter.
In a JSON document there can be only one root, either an object or array. This means the entire document must be read into memory before it can be parsed. This causes severe problems with large files like this one, or when an application wants to append JSON documents (eg records) to an existing file. The entire file would have to be read and written at once.
To overcome this, it's common to store one unindented JSON document per line. This way, to add a new document all the code has to do is append a new line. To read a subset of the lines, an application only needs to seek to the first newline after an offset and read the next N lines.
read_csv can read a subset of such files when lines = True through the nrows parameter:
>>> df2=pd.read_json(r"C:\Users\pankan\Downloads\PET.zip",lines=True,nrows=100)
>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 series_id 100 non-null object
1 name 100 non-null object
2 units 100 non-null object
3 f 100 non-null object
4 unitsshort 100 non-null object
5 description 100 non-null object
6 copyright 100 non-null object
7 source 100 non-null object
8 iso3166 67 non-null object
9 geography 100 non-null object
10 start 100 non-null int64
11 end 100 non-null int64
12 last_updated 100 non-null object
13 data 100 non-null object
Create a DataFrame, print info on it append a row, print info again. The dtype of all the columns changes to object. Why?
myData = np.array([134.29, 136.97, 250.31, 312.28])
mySeries = pd.Series(myData,index=['IBM','P&G','Microsoft','Home Depot'], name="Stock Price")
myData1 = np.array(['120.573B', '336.72B', '1.885T' , '335.974B'])
mySeries1 = pd.Series(myData1, index=['IBM','P&G','Microsoft','Home Depot'], name="Market Cap")
myData2 = np.array([120_573_000_000, 336_720_000_000, 1_885_000_000_000 , 335_974_000_000])
mySeries2 = pd.Series(myData2, index=['IBM','P&G','Microsoft','Home Depot'], name="Market Cap Raw")
myDataFrame = pd.concat([mySeries, mySeries1, mySeries2], axis=1)
#print(myDataFrame)
print(myDataFrame.info())
# After adding the row below, the dtype of numeric types change to object
myData = np.array([20.99, '100M', 100000000 ])
mySeries = pd.Series(myData, index = myDataFrame.columns, name = 'HML')
myDataFrame = myDataFrame.append(mySeries, ignore_index=False)
#print(myDataFrame)
print(myDataFrame.info())
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, IBM to Home Depot
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Stock Price 4 non-null float64
1 Market Cap 4 non-null object
2 Market Cap Raw 4 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 128.0+ bytes
None
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, IBM to HML
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Stock Price 5 non-null object
1 Market Cap 5 non-null object
2 Market Cap Raw 5 non-null object
dtypes: object(3)
memory usage: 160.0+ bytes
None
When you create a Series object containing objects of different incompatible types, the dtype of that Series becomes object.
When you create myData and mySeries the second time, that's exactly what's happening:
>>> myData = np.array([20.99, '100M', 100000000 ])
>>> mySeries = pd.Series(myData, index = myDataFrame.columns, name = 'HML')
>>> mySeries.dtype
dtype('O')
Right after that, you append that Series (of dtype object) to the dataframe. Since the object type is more general than the dtypes of the various columns of the dataframe, those columns get converted to the more general object dtype.
I figure out how to fix it:
tmpSeries = pd.to_numeric(myDataFrame['Stock Price'])
myDataFrame['Stock Price'] = tmpSeries
This changes the column to float64 from object. to_numeric can also be used to convert to other numeric types.
I am trying to create a new pandas dataframe displayDF with 4 columns from the dataframe finalDF.
displayDF = finalDF[['False','True','RULE ID','RULE NAME']]
This command is failing with the error:
KeyError: "['False', 'True'] not in index"
However, I can see the columns "False" and "True" when I run finalDF.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 11
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rule_rec_id 12 non-null object
1 False 12 non-null int64
2 True 12 non-null int64
3 RULE ID 12 non-null object
4 RULE NAME 12 non-null object
5 RULE DESCRIPTION 12 non-null object
dtypes: int64(2), object(4)
memory usage: 672.0+ bytes
Additional Background:
I created finalDF by merging two dataframes (pivot_stackedPandasDF and dfPandaDescriptions)
finalDF = pd.merge(pivot_stackedPandasDF, dfPandaDescriptions, how='left', left_on=['rule_rec_id'], right_on=['RULE ID'])
I created pivot_stackedPandasDF with this command.
pivot_stackedPandasDF = stackedPandasDF.pivot_table(index="rule_rec_id", columns="alert_value", values="count").reset_index()
I think the root cause may be in the way I ran the .pivot_table() command.
I'm Using Mac. In my mac I Install Anaconda. I used Jupiter notebook 6.1.4 in this to work on data. For Learning purpose, I'm using Kaggle SF Salaries Dataset(https://www.kaggle.com/kaggle/sf-salaries).
After Importing the file in Jupyter Notebook & using the command df.info() it is showing specifications like this
>>>><class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 148654 non-null int64
1 EmployeeName 148654 non-null object
2 JobTitle 148654 non-null object
3 BasePay 148049 non-null object
4 OvertimePay 148654 non-null object
5 OtherPay 148654 non-null object
6 Benefits 112495 non-null object
7 TotalPay 148654 non-null float64
8 TotalPayBenefits 148654 non-null float64
9 Year 148654 non-null int64
10 Notes 0 non-null float64
11 Agency 148654 non-null object
12 Status 38119 non-null object
dtypes: float64(3), int64(2), object(8)
memory usage: 14.7+ MB.
In the environment of colab same data set is showing different specifications.
>>>>>>>><class 'pandas.core.frame.DataFrame'>
RangeIndex: 116475 entries, 0 to 116474
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 116475 non-null int64
1 EmployeeName 116475 non-null object
2 JobTitle 116475 non-null object
3 BasePay 115870 non-null float64
4 OvertimePay 116474 non-null float64
5 OtherPay 116474 non-null float64
6 Benefits 80315 non-null float64
7 TotalPay 116474 non-null float64
8 TotalPayBenefits 116474 non-null float64
9 Year 116474 non-null float64
10 Notes 0 non-null float64
11 Agency 116474 non-null object
12 Status 5943 non-null object
dtypes: float64(8), int64(1), object(4)
memory usage: 11.6+ MB.
enter image description here
The dataset is a csv file. The csv format is a plain text format: one line per row (normally delimited with a '\r\n'), each line containing fields separated with a delimiter (normally the comma ','), and optionaly enclosed in quotes.
But there is no indication for the datatypes. Dumb tools (text editors or LibreOffice calc) present the raw data to the user, so that the user may choose the datatypes, delimiters and encoding. Clever tools (Excel and in some sense Collab or Pandas) think that they can guess everything, either because they decide from what they think common or with some heuristics. So there is no surprise that they end with different guesses.
(If you have not guessed it, I hate Excel handling of csv files, and only rely on calc...)
I'm having trouble merging two dataframes in pandas. They are parts of a dataset split between two files, and they share some columns and values, namely 'name' and 'address'. The entries with identical values do not share their index with entries in the other file. I tried variations of the following line:
res = pd.merge(df, df_p, on=['name', 'address'], how="left")
When the how argument was set to 'left', the columns from df_p had no values. 'right' had the opposite effect, with columns from df being empty. 'inner' resulted in an empty dataframe and 'outer' duplicated the number of entries, essentially just appending the results of 'left' and 'right'.
I manually verified that there are identical combinations of 'name' and 'address' values in both files.
Edit: Attempt at merging on a single of those columns appears to be successful, however I want to avoid merging incorrect entries in case 2 people with identical names have different addresses and vice versa
Edit1: Here's some more information on the data-set.
df.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3983 entries, 0 to 3982
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3983 non-null int64
1 name 3983 non-null object
2 address 3983 non-null object
3 race 3970 non-null object
4 marital-status 3967 non-null object
5 occupation 3971 non-null object
6 pregnant 3969 non-null object
7 education-num 3965 non-null float64
8 relationship 3968 non-null object
9 skewness_glucose 3972 non-null float64
10 mean_glucose 3572 non-null float64
11 capital-gain 3972 non-null float64
12 kurtosis_glucose 3970 non-null float64
13 education 3968 non-null object
14 fnlwgt 3968 non-null float64
15 class 3969 non-null float64
16 std_glucose 3965 non-null float64
17 income 3974 non-null object
18 medical_info 3968 non-null object
19 native-country 3711 non-null object
20 hours-per-week 3971 non-null float64
21 capital-loss 3969 non-null float64
22 workclass 3968 non-null object
dtypes: float64(10), int64(1), object(12)
memory usage: 715.8+ KB
example entry from df:
0,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201", White, Married-civ-spouse, Exec-managerial,f,9.0, Husband,1.904881822,79.484375,15024.0,0.667177618, HS-grad,147707.0,0.0,39.49544760000001, >50K,"{'mean_oxygen':'1.501672241','std_oxygen':'13.33605383','kurtosis_oxygen':'11.36579476','skewness_oxygen':'156.77910559999995'}", United-States,60.0,0.0, Private
df_p.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3933 non-null int64
1 name 3933 non-null object
2 address 3933 non-null object
3 age 3933 non-null int64
4 sex 3933 non-null object
5 date_of_birth 3933 non-null object
dtypes: int64(2), object(4)
memory usage: 184.5+ KB
sample entry from df_p:
2273,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201",44, Male,1975-03-26
As you can see, the chosen samples are for the same person, but their index does not match, which is why I tried using the name and address columns.
Edit2: Changing the order of df and df_p in the merge seems to have solved the issue, though I have no clue why.