Hello I have the following example of df
col1 col2
12.4 12.32
11.4- 2.3
2.0- 1.1
I need the negative sign to be at the beginning of the number and not at the end
col1 col2
12.4 12.32
-11.4 2.3
-2.0 1.1
I am trying with the following function, so far I can get the data with the sign and print them correctly but I no longer know how to replace them in my column
updated_data = '' # iterate over the content
for line in df["col1"]:
# removing last word
updated_line = ' '.join(str(line).split('-')[:-1])
print(updated_line)
Could you help me please? or if there is an easier way to do it I would appreciate it
here is one way to do it, using np.where
#check if string contains - at the end, and then converting it float after removing '-' and multiplying by -1
df['col1']=np.where(df['col1'].str.strip().str.endswith('-'),
df['col1'].str.replace(r'-','',regex=True).astype('float')*(-1),
df['col1']
)
df
col1 col2
0 12.4 12.32
1 -11.4 2.30
2 -2.0 1.10
Related
Trying to split and parse characters from an column and submit the parsed data into different column .
I was trying the same by parsing with _ in the given column data, It was working good until the number of '_' present in the string was fixed to 2.
Input Data:
Col1
U_a65839_Jan87Apr88
U_b98652_Feb88Apr88_(2).jpg.pdf
V_C56478_mar89Apr89
Q_d15634_Apr90Apr91
Q_d15634_Apr90Apr91_(3).jpeg.pdf
S_e15336_may91Apr93
NaN
Expected Output:
col2
Jan87Apr88
Feb88Apr88
mar89Apr89
Apr90Apr91
Apr90Apr91
may91Apr93
Code i have been trying :
df = pd.read_excel(open(r'Dats.xlsx', 'rb'), sheet_name='Sheet1')
df['Col2'] = df.Col1.str.replace(
'.*_', '', regex=True
)
print(df['Col2'])
I think you want this:
col2 = df.Col1.str.split("_", expand=True)[2]
output:
0 Jan87Apr88
1 Feb88Apr88
2 mar89Apr89
3 Apr90Apr91
4 Apr90Apr91
5 may91Apr93
6 NaN
(you can dropna if you don't want the last row)
Use str.extract here:
df["col2"] = df["Col1"].str.extract(r'((?:[a-z]{3}\d{2}){2})', flags=re.IGNORECASE)
Demo
Based on your question, the pandas DataFrame apply can be a good solution:
First, clean the DataFrame by replacing NaNs with empty string ''
df = pd.DataFrame(data=['U_a65839_Jan87Apr88', 'U_b98652_Feb88Apr88_(2).jpg.pdf', 'V_C56478_mar89Apr89', 'Q_d15634_Apr90Apr91', 'Q_d15634_Apr90Apr91_(3).jpeg.pdf', 'S_e15336_may91Apr93', None], columns=['Col1'])
df = df.fillna('')
Col1
0 U_a65839_Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf
2 V_C56478_mar89Apr89
3 Q_d15634_Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf
5 S_e15336_may91Apr93
6
Next, define a function to extract the required string with regex
def fun(s):
import re
m = re.search(r'\w{3}\d{2}\w{3}\d{2}', s)
if m:
return m.group(0)
else:
return ''
Then, easily apply the function to DataFrame:
df['Col2'] = df['Col1'].apply(fun)
Col1 Col2
0 U_a65839_Jan87Apr88 Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf Feb88Apr88
2 V_C56478_mar89Apr89 mar89Apr89
3 Q_d15634_Apr90Apr91 Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf Apr90Apr91
5 S_e15336_may91Apr93 may91Apr93
6
Hope the above helps.
I am trying to remove alpha characters and special characters(,) from the column values. When trying to remove the alpha characters it gives NaN as output .
Input Data :
col2
2565.0
23899
876.44
1765.7
3,253.0CA
9876.9B
Output Data :
col2
2565.0
23899
876.44
1765.7
3253.0
9876.9
Code i have been using:
df['col2'] = df['col2'].str.replace(r"[a-zA-Z]",'')
df['col2']=df['col2'].fillna('').str.replace(',',"").astype(float)
Please suggest how to resolve this.
Use Series.replace and regex which matches "not numbers and dot"
df['col2'] = df.col2.replace('[^\d.]', '', regex=True).astype(float)
Output
col2
0 2565.00
1 23899.00
2 876.44
3 1765.70
4 3253.00
5 9876.90
Use Series.str.replace:
df['col2'] = df['col2'].str.replace(r'[a-zA-Z,]','', regex=True).astype(float)
print (df)
col2
0 2565.00
1 23899.00
2 876.44
3 1765.70
4 3253.00
5 9876.90
I am loading a .csv with Pandas (pd.read_csv). Normally this would yield floats, however a few of my datasets have a 'q' inside some of the > 100000 numbers (for instance a matrix of 33x60000) included in the .csv file. Like this: '-13q27.20148186934421000000' (the q's are NOT always in the same place). This causes Pandas to not see them as numbers but as strings. This makes a conversion to float impossible, hence my question: how can I easily find the 'q's and remove them?
I tried using a for loop and check for each individual string if it contains a 'q', this however takes ages:
for i in range(tmp.values.shape[0]):
for j in range(tmp.values.shape[1]):
if 'q' in tmp.values[i,j]:
print('oh oh')
It is also possible that it is sometimes another letter then a 'q', so maybe it would be wise to look for letters in general, I have no idea how to do this in an efficient way.
Thanks in advance for your help!
Use pandas.DataFrame.replace with regex=True:
Given df:
col1 col2 col3
0 1.1 2.2 3.3
1 2q.2 3.q4 q5.3
2 4.4 5.5 6.6
df = df.replace('q', '', regex=True).astype(float)
print(df.dtypes)
print(df)
Output:
col1 float64
col2 float64
col3 float64
dtype: object
col1 col2 col3
0 1.1 2.2 3.3
1 2.2 3.4 5.3
2 4.4 5.5 6.6
you can delete all characters (here q) from a specific column (here named result):
data['result'] = data['result'].map(lambda x: x.lstrip('q').rstrip('q'))
afterwards you can convert your column to float.
data['result'] = data['result'].astype(float)
or alternativ:
df['result'] = df['result'].str.replace(r'\D', '').astype(float)
df.replace(['q'], 0.0, inplace=True)
I have a data frame, where I would like to merge the content of two rows, and have it separated by underscore, within the same cell.
If this is the original DF:
0 eye-right eye-right hand
1 location location position
2 12 27.7 2
3 14 27.6 2.2
I would like it to become:
0 eye-right_location eye-right_location hand_position
1 12 27.7 2
2 14 27.6 2.2
Eventually I would like to translate row 0 to become header, and reset indexes for the entire df.
You can set your column labels, slice via iloc, then reset_index:
print(df)
# 0 1 2
# 0 eye-right eye-right hand
# 1 location location position
# 2 12 27.7 2
# 3 14 27.6 2.2
df.columns = (df.iloc[0] + '_' + df.iloc[1])
df = df.iloc[2:].reset_index(drop=True)
print(df)
# eye-right_location eye-right_location hand_position
# 0 12 27.7 2
# 1 14 27.6 2.2
I like jpp's answer a lot. Short and sweet. Perfect for quick analysis.
Just one quibble: The resulting DataFrame is generically typed. Because strings were in the first two rows, all columns are considered type object. You can see this with the info method.
For data analysis, it's often preferable that columns have specific numeric types. This can be tidied up with one more line:
df.columns = df.iloc[0] + '_' + df.iloc[1]
df = df.iloc[2:].reset_index(drop=True)
df = df.apply(pd.to_numeric)
The third line here applies Panda's to_numeric function to each column in turn, leaving a more-typed DataFrame:
While not essential for simple usage, as soon as you start performing math on DataFrames, or start using very large data sets, column types become something you'll need to pay attention to.
I have the following dataframe:
contract
0 WTX1518X22
1 WTX1518X20.5
2 WTX1518X19
3 WTX1518X15.5
I need to add a new column containing everything following the last 'X' from the first column. So the result would be:
contract result
0 WTX1518X22 22
1 WTX1518X20.5 20.5
2 WTX1518X19 19
3 WTX1518X15.5 15.5
So I figure I first need to find the string index position of the last 'X' (because there may be more than one 'X' in the string). Then get a substring containing everything following that index position for each row.
EDIT:
I have managed to get the index position of 'X' as required:
df.['index_pos'] = df['contract'].str.rfind('X', start=0, end=None)
But I still can't seem to get a new column containing all characters following the 'X'. I am trying:
df['index_pos'] = df['index_pos'].convert_objects(convert_numeric=True)
df['result'] = df['contract'].str[df['index_pos']:]
But this just gives me an empty column called 'result'. This is strange because if I do the following then it works correctly:
df['result'] = df['contract'].str[8:]
So I just need a way to not hardcode '8' but to instead use the column 'index_pos'. Any suggestions?
Use vectorised str.split to split the string and cast the last split to float:
In [10]:
df['result'] = df['contract'].str.split('X').str[-1].astype(float)
df
Out[10]:
contract result
0 WTX1518X22 22.0
1 WTX1518X20.5 20.5
2 WTX1518X19 19.0
3 WTX1518X15.5 15.5
import pandas as pd
import re as re
df['result'] = df['contract'].map(lambda x:float(re.findall('([0-9\.]+)$',x)[0]))
Out[34]:
contract result
0 WTX1518X22 22.0
1 WTX1518X20.5 20.5
2 WTX1518X19 19.0
3 WTX1518X15.5 15.5
A similar approach to the one by EdChump using regular expressions, this one only assumes that the number is at the end of the string.