How to empty string in pandas [duplicate] - python

This question already has answers here:
Replacing blank values (white space) with NaN in pandas
(13 answers)
Closed 2 years ago.
so, I've been working with pandas in python and I got extracted data from external system with lots of spaces at the end of each column. I got an idea to use on each Series a str.strip() method with a code:
Data["DESCRIPTION"] = Data["DESCRIPTION"].str.strip()
It basically did its job but I noticed that when I check properties of data frame using I run into an issue that if in one value there were only spaces without any text then it is empty but it does not convert that scalar as null:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18028 entries, 0 to 18027
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 VIN 18028 non-null object
1 DESCRIPTION 18028 non-null object
2 DESCRIPTION 2 18028 non-null object
3 ENGINE 18023 non-null object
4 TRANSMISSION 18028 non-null object
5 PAINT 18028 non-null object
6 EXT_COLOR_CODE 18028 non-null object
7 EXT_COLOR_DESC 18028 non-null object
8 INT_COLOR_DESC 18028 non-null object
9 COUNTRY 18028 non-null object
10 PROD_DATE 18028 non-null object
dtypes: object(11)
memory usage: 1.5+ MB
However checking a condition if the string is empty:
Data['DESCRIPTION 2'] == ""
0 True
1 True
2 True
3 True
4 True
...
18023 True
18024 True
18025 True
18026 True
18027 True
Name: DESCRIPTION 2, Length: 18028, dtype: bool
How could I possibly convert all those as null so I could drop them using dropna() function?
I'd be grateful for any suggestions.

To remove trailing spaces and replace an empty string or records with only spaces as Nan run the below command.
Data["DESCRIPTION"].str.strip().replace(r'^\s*$', np.nan, regex=True)
Please refer to this page Replacing blank values (white space) with NaN in pandas

Related

Why am I getting an empty index?

All this is asking me to do is write a code that shows if there are any missing values where it is not the customers first order. I have provided the DataFrame. Should I use column 'Order_number" instead? Is my code wrong?
I named the DataFrame df_orders.
I thought my code would find the columns that have missing values and a greater order number than 1.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478967 entries, 0 to 478966
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 478967 non-null int64
1 user_id 478967 non-null int64
2 order_number 478967 non-null int64
3 order_dow 478967 non-null int64
4 order_hour_of_day 478967 non-null int64
5 days_since_prior_order 450148 non-null float64
dtypes: float64(1), int64(5)
memory usage: 21.9 MB
None
# Are there any missing values where it's not a customer's first order?
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna() > 1]
print(m_v_fo.head())
Empty DataFrame
Columns: [order_id, user_id, order_number, order_dow, order_hour_of_day,
days_since_prior_order]
Index: []
When you say .isna() you are returning a series of True or False. So that will never be > 1
Instead, try this:
m_v_fo= df_orders[df_orders['days_since_prior_order'].isna().sum() > 1]
If that doesn't solve the problem, then I'm not sure - try editing your question to add more detail and I can try again. :)
Update: I read your question again, and I think you're doing this out of order. First you need to filter on days_since_prior_order and then look for na.
m_v_fo = df_orders[df_orders['days_since_prior_order'] > 1].isna()

Read in nested JSON from zipfile via URL [Python]

I am trying to read in text file from EIA that is zipped. I have been able to get the file downloaded, unzipped, and converted to a string that is I believe JSON formatted but can not seem to convert it into a DataFrame. Help is greatly appreciated.
import pandas as pd
import requests
import io
import zipfile
import json
url_data='https://api.eia.gov/bulk/PET.zip'
r = requests.get(url_data)
with zipfile.ZipFile(io.BytesIO(r.content), mode="r") as archive:
archive.printdir()
text = archive.read("PET.txt") .decode(encoding="utf-8")
To read this file use :
import pandas as pd
df=pd.read_json(path_to_zip,lines=True)
df contains all rows
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188297 entries, 0 to 188296
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 series_id 174220 non-null object
1 name 188297 non-null object
2 units 174220 non-null object
3 f 174220 non-null object
4 unitsshort 174220 non-null object
5 description 174220 non-null object
6 copyright 174220 non-null object
7 source 174220 non-null object
8 iso3166 134979 non-null object
9 geography 161968 non-null object
10 start 174220 non-null float64
11 end 174220 non-null float64
12 last_updated 174220 non-null object
13 data 174220 non-null object
14 geography2 105177 non-null object
15 category_id 14077 non-null float64
16 parent_category_id 14077 non-null float64
17 notes 14077 non-null object
18 childseries 14077 non-null object
read_json can already read compressed JSON files. This isn't a JSON file though, it contains one JSON document per line. You can read such files with the lines parameter.
In a JSON document there can be only one root, either an object or array. This means the entire document must be read into memory before it can be parsed. This causes severe problems with large files like this one, or when an application wants to append JSON documents (eg records) to an existing file. The entire file would have to be read and written at once.
To overcome this, it's common to store one unindented JSON document per line. This way, to add a new document all the code has to do is append a new line. To read a subset of the lines, an application only needs to seek to the first newline after an offset and read the next N lines.
read_csv can read a subset of such files when lines = True through the nrows parameter:
>>> df2=pd.read_json(r"C:\Users\pankan\Downloads\PET.zip",lines=True,nrows=100)
>>> df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 series_id 100 non-null object
1 name 100 non-null object
2 units 100 non-null object
3 f 100 non-null object
4 unitsshort 100 non-null object
5 description 100 non-null object
6 copyright 100 non-null object
7 source 100 non-null object
8 iso3166 67 non-null object
9 geography 100 non-null object
10 start 100 non-null int64
11 end 100 non-null int64
12 last_updated 100 non-null object
13 data 100 non-null object

Why Are Some Columns "Not In Index" When Creating a New Dataframe?

I am trying to create a new pandas dataframe displayDF with 4 columns from the dataframe finalDF.
displayDF = finalDF[['False','True','RULE ID','RULE NAME']]
This command is failing with the error:
KeyError: "['False', 'True'] not in index"
However, I can see the columns "False" and "True" when I run finalDF.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 11
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rule_rec_id 12 non-null object
1 False 12 non-null int64
2 True 12 non-null int64
3 RULE ID 12 non-null object
4 RULE NAME 12 non-null object
5 RULE DESCRIPTION 12 non-null object
dtypes: int64(2), object(4)
memory usage: 672.0+ bytes
Additional Background:
I created finalDF by merging two dataframes (pivot_stackedPandasDF and dfPandaDescriptions)
finalDF = pd.merge(pivot_stackedPandasDF, dfPandaDescriptions, how='left', left_on=['rule_rec_id'], right_on=['RULE ID'])
I created pivot_stackedPandasDF with this command.
pivot_stackedPandasDF = stackedPandasDF.pivot_table(index="rule_rec_id", columns="alert_value", values="count").reset_index()
I think the root cause may be in the way I ran the .pivot_table() command.

Identify the column name on the basis of column data in .CSV file

I have list of column names of csv file like:[email, null, password, ip_address, user_name, phone_no] .Consider I have csv with data:
03-Sep-14,foo2#yahoo.co.jp,,
20-Jan-13,foo3#gmail.com,,
20-Feb-15,foo4#yahoo.co.jp,,
12-May-16,foo5#hotmail.co.jp,,
25-May-16,foo6#hotmail.co.jp,,
Now I want to identify the column names of this csv file on the basis of data, like col_1 is date and col_2 is mail.
I tried to use pandas. like getting all values from col_1 and then identify either it is mail or something else but couldn't get much.
i tried something like this:
df = pd.read_csv('demo.csv', header=None)
df[df[1].str.contains("#")]
but its not helping me.
thank you.
Have you tried using Pandas dataframe.infer_objects()?
# importing pandas as pd
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":["alpha", 15, 81, 1, 100],
"B":[2, 78, 7, 4, 12],
"C":["beta", 21, 14, 61, 5]})
# data frame info and data
df.info()
print(df)
# slice all rows except first into a new frame
df_temp = df[1:]
# print it
print(df_temp)
df_temp.info()
# infer the object types
df_inferred = df_temp.infer_objects()
# print inferred
print(df_inferred)
df_inferred.info()
Here's the output from the above py script.
Initially df is inferred as object, int64 and object for A, B and C respectively.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 5 non-null object
1 B 5 non-null int64
2 C 5 non-null object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes
A B C
0 alpha 2 beta
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
A B C
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
After removing the first exception row which has the strings, the data frame is still showing the same type.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null object
1 B 4 non-null int64
2 C 4 non-null object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
A B C
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
After infer_objects(), the types have been correctly inferred as int64.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null int64
1 B 4 non-null int64
2 C 4 non-null int64
dtypes: int64(3)
memory usage: 228.0 bytes
Is this what you need?
The OP has clarified that s/he needs to determine if the column contains one of the following:
email
password
ip_address
user_name
phone_no
null
There are a couple of approaches we could use:
Approach #1: Take a random sample of rows and analyze their column contents using heuristics
We could use the following heuristic rules to identify column content type.
email: Use a regex to check for presence of a valid email.
[Stackoverflow - How to validate an email address]
https://www.regular-expressions.info/email.html
https://emailregex.com/
ip_address: Use a regex to match an ip_address.
Stackoverflow - Validating IPv4 addresses with Regex
Stackoverflow - Regular expression that matches valid IPv6 addresses
username: Use a table of common first names or lastnames and search for them within the username
phone_no: Strip +, SPACE, -, (, ) -- alternatively, all special characters. If you are left with all digits, we have a potential phone number
null: All column contents in sample are null
password: If it doesn't satisfy rules 1 through 5, we identify it as password
We should do the analysis independently on each column and keep track of how many sample items in the column matched each heuristic. Then we could pick the classification with the maximum number of matches.
Approach #2: Train a classifier using training data (obtained from real system) and use it to determine column content type
This is a machine learning classification task. A naive approach would be to take each column's data mapped to the content type as the training input.
Using the OP's sample set:
03-Sep-14,foo2#yahoo.co.jp,,
20-Jan-13,foo3#gmail.com,,
20-Feb-15,foo4#yahoo.co.jp,,
12-May-16,foo5#hotmail.co.jp,,
25-May-16,foo6#hotmail.co.jp,,
We would have:
data_content, content_type
03-Sep-14, date
20-Jan-13, date
20-Feb-15, date
12-May-16, date
25-May-16, date
foo2#yahoo.co.jp, email
foo3#gmail.com, email
foo4#yahoo.co.jp, email
foo5#hotmail.co.jp, email
foo6#hotmail.co.jp, email
We can then use machine learning to build a text-to-class multi-class classifier. Some references are given below:
Multi-Class Text Classification from Start to Finish
Multi-Class Text Classification with Scikit-Learn

How can I convert a column of dataframe from object to float

Hear is my original dataframe columns type:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NAME 23605 non-null object
1 DEPARTMENT_NAME 23605 non-null object
2 TITLE 23605 non-null object
3 REGULAR 21939 non-null object
4 RETRO 13643 non-null object
5 OTHER 13351 non-null object
6 OVERTIME 6826 non-null object
7 INJURED 1312 non-null object
8 DETAIL 2355 non-null object
9 QUINN/EDUCATION INCENTIVE 1351 non-null object
10 TOTAL EARNINGS 23605 non-null object
11 POSTAL 23605 non-null object
I want to convert some of them into float type, say Total earnings, I tried:
df['TOTAL EARNINGS'] = df['TOTAL EARNINGS'].astype(int)
and
df['TOTAL EARNINGS'] = pd.to_numeric(df['TOTAL EARNINGS'])
But I got:
ValueError: setting an array element with a sequence.
or
TypeError: Invalid object type at position 0
And I don't know why, is there any other methods to do so?
Here is my data: https://data.boston.gov/dataset/418983dc-7cae-42bb-88e4-d56f5adcf869/resource/31358fd1-849a-48e0-8285-e813f6efbdf1/download/employeeearningscy18full.csv
Here are some pictures of my dataframe:
enter image description here
enter image description here
enter image description here
This happens because your original data has 2 rows which are completely text.
First execute command below to clean those rows.
df = df[df["TOTAL EARNINGS"]!="TOTAL EARNINGS"]
Then, change the datatype
df['TOTAL EARNINGS'] = df['TOTAL EARNINGS'].astype(float)
You can check datatypes thereafter as
df.dtypes

Categories