I have the following dataframe:
contract
0 WTX1518X22
1 WTX1518X20.5
2 WTX1518X19
3 WTX1518X15.5
I need to add a new column containing everything following the last 'X' from the first column. So the result would be:
contract result
0 WTX1518X22 22
1 WTX1518X20.5 20.5
2 WTX1518X19 19
3 WTX1518X15.5 15.5
So I figure I first need to find the string index position of the last 'X' (because there may be more than one 'X' in the string). Then get a substring containing everything following that index position for each row.
EDIT:
I have managed to get the index position of 'X' as required:
df.['index_pos'] = df['contract'].str.rfind('X', start=0, end=None)
But I still can't seem to get a new column containing all characters following the 'X'. I am trying:
df['index_pos'] = df['index_pos'].convert_objects(convert_numeric=True)
df['result'] = df['contract'].str[df['index_pos']:]
But this just gives me an empty column called 'result'. This is strange because if I do the following then it works correctly:
df['result'] = df['contract'].str[8:]
So I just need a way to not hardcode '8' but to instead use the column 'index_pos'. Any suggestions?
Use vectorised str.split to split the string and cast the last split to float:
In [10]:
df['result'] = df['contract'].str.split('X').str[-1].astype(float)
df
Out[10]:
contract result
0 WTX1518X22 22.0
1 WTX1518X20.5 20.5
2 WTX1518X19 19.0
3 WTX1518X15.5 15.5
import pandas as pd
import re as re
df['result'] = df['contract'].map(lambda x:float(re.findall('([0-9\.]+)$',x)[0]))
Out[34]:
contract result
0 WTX1518X22 22.0
1 WTX1518X20.5 20.5
2 WTX1518X19 19.0
3 WTX1518X15.5 15.5
A similar approach to the one by EdChump using regular expressions, this one only assumes that the number is at the end of the string.
Related
I have a pandas dataframe with different formats for one column like this
Name
Values
First
5-9
Second
7
Third
-
Fourth
12-16
I need to iterate over all Values column, and if the format is like the first row 5-9 or like fourth row 12-16 replace it with the mean between the 2 numbers in string.
For first row replace 5-9 to 7, or for fourth row replace 12-16 to 14.
And if the format is like third row - replace it to 0
I have tried
if df["Value"].str.len() > 1:
df["Value"] = df["Value"].str.split('-')
df["Value"] = (df["Value"][0] + df["Value"][1]) / 2
elif df["Value"].str.len() == 1:
df["Value"] = df["Value"].str.replace('-', 0)
Expected output
Name
Values
First
7
Second
7
Third
0
Fourth
14
Let us split and expand the column then cast values to float and calculate mean along column axis:
s = df['Values'].str.split('-', expand=True)
df['Values'] = s[s != ''].astype(float).mean(1).fillna(0)
Name Values
0 First 7.0
1 Second 7.0
2 Third 0.0
3 Fourth 14.0
You can use str.replace with customized replacement function
mint = lambda s: int(s or 0)
repl = lambda m: str(sum(map(mint, map(m.group, [1,2])))/2)
df['Values'] = df['Values'].str.replace('(\d*)-(\d*)', repl, regex=True)
print(df)
Name Values
0 First 7.0
1 Second 7
2 Third 0.0
3 Fourth 14.0
I have a data frame with repeating string values. I want to reorder in a desired order.
My code:
df =
name
0 Fix
1 1Ax
2 2Ax
3 2Ax
4 1Ax
5 Fix
df.sort_values(by=['name'],ignore_index=True,ascending=False))
print(df)
df =
name
0 Fix
1 Fix
2 2Ax
3 2Ax
4 1Ax
5 1Ax
Expected answer:
df =
name
0 Fix
1 Fix
2 1Ax
3 1Ax
4 2Ax
5 2Ax
Currently you are sorting in reverse alphabetical order: so 'F' comes before '2' which comes before '1'. Changing ascending to True will place 'Fix' at the bottom.
It's a bit of a hack, but you could pull out the rows where the first character is number of sort them separately...
import pandas as pd
df = pd.DataFrame(['Fix', '1Ax','2Ax','2Ax','1Ax','Fix'], columns=['name'])
# Sort alphabetically
df = df.sort_values(by=['name'],ignore_index=True,ascending=True)
# Get first character of string
first_digit = df['name'].str[0]
# Get cases where first character is (not) a number
starts_with_digits = df[first_digit.str.isdigit()]
not_starts_with_digits = df[~first_digit.str.isdigit()]
# Switch order of row with first character which start with number/not a number
pd.concat([not_starts_with_digits, starts_with_digits]).reset_index(drop=True)
I have a data frame, where I would like to merge the content of two rows, and have it separated by underscore, within the same cell.
If this is the original DF:
0 eye-right eye-right hand
1 location location position
2 12 27.7 2
3 14 27.6 2.2
I would like it to become:
0 eye-right_location eye-right_location hand_position
1 12 27.7 2
2 14 27.6 2.2
Eventually I would like to translate row 0 to become header, and reset indexes for the entire df.
You can set your column labels, slice via iloc, then reset_index:
print(df)
# 0 1 2
# 0 eye-right eye-right hand
# 1 location location position
# 2 12 27.7 2
# 3 14 27.6 2.2
df.columns = (df.iloc[0] + '_' + df.iloc[1])
df = df.iloc[2:].reset_index(drop=True)
print(df)
# eye-right_location eye-right_location hand_position
# 0 12 27.7 2
# 1 14 27.6 2.2
I like jpp's answer a lot. Short and sweet. Perfect for quick analysis.
Just one quibble: The resulting DataFrame is generically typed. Because strings were in the first two rows, all columns are considered type object. You can see this with the info method.
For data analysis, it's often preferable that columns have specific numeric types. This can be tidied up with one more line:
df.columns = df.iloc[0] + '_' + df.iloc[1]
df = df.iloc[2:].reset_index(drop=True)
df = df.apply(pd.to_numeric)
The third line here applies Panda's to_numeric function to each column in turn, leaving a more-typed DataFrame:
While not essential for simple usage, as soon as you start performing math on DataFrames, or start using very large data sets, column types become something you'll need to pay attention to.
I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)
I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values