simple question here -- how do I replace all of the whitespaces in a column with a zero?
For example:
Name Age
John 12
Mary
Tim 15
into
Name Age
John 12
Mary 0
Tim 15
I've been trying using something like this but I am unsure how Pandas actually reads whitespace:
merged['Age'].replace(" ", 0).bfill()
Any ideas?
merged['Age'] = merged['Age'].apply(lambda x: 0 if x == ' ' else x)
Use the built in method convert_objects and set param convert_numeric=True:
In [12]:
# convert objects will handle multiple whitespace, this will convert them to NaN
# we then call fillna to convert those to 0
df.Age = df[['Age']].convert_objects(convert_numeric=True).fillna(0)
df
Out[12]:
Name Age
0 John 12
1 Mary 0
2 Tim 15
Here's an answer modified from this, more thorough question. I'll make it a little bit more Pythonic and resolve your basestring issue.
def ws_to_zero(maybe_ws):
try:
if maybe_ws.isspace():
return 0
else:
return maybe_ws
except AttributeError:
return maybe_ws
d.applymap(ws_to_zero)
where d is your dataframe.
if you want to use NumPy, then you can use the below snippet:
import numpy as np
df['column_of_interest'] = np.where(df['column_of_interest']==' ',0,df['column_of_interest']).astype(float)
While Paulo's response is excellent, my snippet above may be useful when multiple criteria are required during advanced data manipulation.
Related
I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add
Hello guys I need your wisdom,
I'm still new to python and pandas and I'm looking to achieve the following thing.
df = pd.DataFrame({'code': [125, 265, 128,368,4682,12,26,12,36,46,1,2,1,3,6], 'parent': [12,26,12,36,46,1,2,1,3,6,'a','b','a','c','f'], 'name':['unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','g1','g2','g1','g3','g6']})
ds = pd.DataFrame({'code': [125, 265, 128,368,4682], 'name': ['Eagle','Cat','Koala','Panther','Dophin']})
I would like to add a new column in the ds dataframe with the name of the highest parent.
as an example for the first row :
code | name | category
125 | Eagle | a
"a" is the result of a loop between df.code and df.parent 125 > 12 > 1 > a
Since the last parent is not a number but a letter i think I must use a regex and than .merge from pandas to populate the ds['category'] column. Also maybe use an apply function but it seems a little bit above my current knowledge.
Could anyone help me with this?
Regards,
The following is certainly not the fastest solution but it works if your dataframes are not too big. First create a dictionary from the parent codes of df and then apply this dict recursively until you come to an end.
p = df[['code','parent']].set_index('code').to_dict()['parent']
def get_parent(code):
while par := p.get(code):
code = par
return code
ds['category'] = ds.code.apply(get_parent)
Result:
code name category
0 125 Eagle a
1 265 Cat b
2 128 Koala a
3 368 Panther c
4 4682 Dophin f
PS: get_parent uses an assignment expression (Python >= 3.8), for older versions of Python you could use:
def get_parent(code):
while True:
par = p.get(code)
if par:
code = par
else:
return code
This is my first time trying to use a lambda function, please help me determine what I'm doing incorrectly. I wrote a function to output time zones based on zip codes. The function works but not sure how to implement it as a lambdas function to create a new column in my dataframe
import pandas as pd
from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
def find_tz(zip_code):
try:
tz = zcdb[zip_code].timezone
return tz
except:
return '?'
data = [['Jane','92804'], ['Bob','75014'], ['Ashley','07650']]
df = pd.DataFrame(data, columns=['Contact','Zip'])
in: df
out:
Contact Zip
0 Jane 92804
1 Bob 75014
2 Ashley 07650
Do note that that zip code column data are strings, since US zip codes have leading 0s.
Me testing that the function I wrote works on values from df:
in: print(find_tz(df.loc[0,'Zip']))
print(find_tz(df.loc[1,'Zip']))
print(find_tz(df.loc[2,'Zip']))
out:
-8
-6
-5
My attempt at using a lambda function to create a new Timezone column, and the incorrect result I am getting:
in: df = df.assign(Timezone = lambda x: find_tz(x.Zip))
df
out:
Contact Zip Timezone
0 Jane 92804 ?
1 Bob 75014 ?
2 Ashley 07650 ?
My desired resulting dataframe would look like:
Contact Zip Timezone
0 Jane 92804 -8
1 Bob 75014 -6
2 Ashley 07650 -5
ETA: when I changed my find_tz() function to something like concatenating the input with another string of text, the lambda worked as I expected, so I'm not sure what I've done wrong.
You can use:
df['Timezone'] = df.Zip.apply(find_tz)
When you call lambda x: find_tz(x.Zip) the find_tz function is passed a Pandas series not the individual zip codes
I'm trying to modify the Address column data by removing all the characters before the comma.
Sample data:
**ADDRESS**
0 Ksfc Layout,Bangalore
1 Vishweshwara Nagar,Mysore
2 Jigani,Bangalore
3 Sector-1 Vaishali,Ghaziabad
4 New Town,Kolkata
Expected Output:
**ADDRESS**
0 Bangalore
1 Mysore
2 Bangalore
3 Ghaziabad
4 Kolkata
I tried this code but it's not working can someone correct the code?
import pandas as pd
import regex as re
data = pd.read_csv("train.csv")
data.ADDRESS.replace(re.sub(r'.*,',"", data.ADDRESS), regex=True, inplace=True)
Try this:
data.ADDRESS = data.ADDRESS.str.split(',').str[-1]
You can do it without a regex:
def removeFirst(x):
return x.split(",")[-1]
df['ADDRESS'] = df['ADDRESS'].apply(removeFirst)
You can try like this without Regex:
data['ADDRESS'] = data['ADDRESS'].str.split(',').str[-1]
Use Series.str.replace:
data['ADDRESS'] = data['ADDRESS'].str.replace(r'.*,', '')
See proof
I have a df,
ID Code Desc
1 146 ReadWriteThink offers hundreds of classroom
2 112AV67TYG language arts and literacy
3 3.75E+11 View the artwork for a U.S
4 3.22E+11 short passage on the details of this eloquent folktale
The code column has all types of values. I want to convert scientific notation to int in the code variable. I tried to convert that column to int but since it has 112AV67TYG, it says unable to parse string. Kindly help.
Expected output:
ID Code Desc
1 146 ReadWriteThink offers hundreds of classroom
2 112AV67TYG language arts and literacy
3 375072000000 View the artwork for a U.S
4 321881000000 short passage on the details of this eloquent
folktale
Try this:
df = pd.DataFrame({"code" : ['146','112AV67TYG','3.75E+11','3.22E+11']})
import ast
def try_convert(x):
try:
return int(ast.literal_eval(x))
except:
return x
df['new_code'] = df['code'].apply(lambda x : try_convert(x))
df
#######
code new_code
0 146 146
1 112AV67TYG 112AV67TYG
2 3.75E+11 375000000000
3 3.22E+11 322000000000
condition = df.Code.str.contains('E+', regex=False)
df.loc[condition, 'Code'] = df.loc[condition, 'Code'].astype(int)