I have an excel, the data has two $ , when I read it using pandas, it will convert them to a very special text style.
import pandas as pd
df = pd.DataFrame({ 'Bid-Ask':['$185.25 - $186.10','$10.85 - $11.10','$14.70 - $15.10']})
after pd.read_excel
df['Bid'] = df['Bid-Ask'].str.split('−').str[0]
above code doesn't work due to $ make my string into a special style text.the Split function doesn't work.
my expected result
Do not split. Using str.extract is likely the most robust:
df[['Bid', 'Ask']] = df['Bid-Ask'].str.extract(r'(\d+(?:\.\d+)?)\D*(\d+(?:\.\d+)?)')
Output:
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
There is a non-breaking space (\xa0) in your string. That's why the split doesn't work.
I copied the strings (of your df) one by one into an Excel file and then imported it with pd.read_excel.
The column looks like this then:
repr(df['Bid-Ask'])
'0 $185.25\xa0- $186.10\n1 $10.85\xa0- $11.10\n2 $14.70\xa0- $15.10\nName: Bid-Ask, dtype: object'
Before splitting you can replace that and it'll work.
df['Bid-Ask'] = df['Bid-Ask'].astype('str').str.replace('\xa0',' ', regex=False)
df[['Bid', 'Ask']] = df['Bid-Ask'].str.replace('$','', regex=False).str.split('-',expand = True)
print(df)
Bid-Ask Bid Ask
0 $185.25 - $186.10 185.25 186.10
1 $10.85 - $11.10 10.85 11.10
2 $14.70 - $15.10 14.70 15.10
You have to use the lambda function and apply the method together to split the column values into two and slice the value
df['Bid'] = df['Bid-Ask'].apply(lambda x: x.split('-')[0].strip()[1:])
df['Ask'] = df['Bid-Ask'].apply(lambda x: x.split('-')[1].strip()[1:])
output:
Bid-Ask Bid Ask
0 185.25− 186.10 185.25 186.1
1 10.85− 11.10 10.85 11.1
2 14.70− 15.10 14.70 15.1
Related
I'd like to create function a which converts objects to float. Tried to find some solution, but always get some errors:
# sample dataframe
d = {'price':['−$13.79', '−$ 13234.79', '$ 132834.79', 'R$ 75.900,00', 'R$ 69.375,12', '- $ 2344.92']}
df = pd.DataFrame(data=d)
I tried this code, first wanted just to find solution.
df['price'] = (df.price.str.replace("−$", "-").str.replace(r'\w+\$\s+', '').str.replace('.', '')\
.str.replace(',', '').astype(float)) / 100
So idea was to convert -$ to - (for negative values). Then $ to ''.
But as a result I get:
ValueError: could not convert string to float: '−$1379'
You can extract the numbers on one side, and identify whether there is a minus in the other side, then combine:
factor = np.where(df['price'].str.match(r'[−-]'), -1, 1)/100
out = (pd.to_numeric(df['price'].str.replace(r'\D', '', regex=True), errors='coerce')
.mul(factor)
)
output:
0 -13.79
1 -13234.79
2 132834.79
3 75900.00
4 69375.12
5 -2344.92
Name: price, dtype: float64
Can you use re ?
Like this:
import re
df['price'] = float(re.sub(r'[^\-.0-9]', '', df.price.str)) / 100
I'm just removing by regex all the symbols that are not 0-9, ".", "," & "-".
BTW, no clue why you divide it by 100...
df["price2"] = pd.to_numeric(df["price"].str.replace("[R$\s\.,]", "")) / 100
df["price3"] = df["price"].str.replace("[R$\s\.,]", "").astype(float) / 100
df
A few notes:
The dot is the regex symbel for everything.
The - symbel you are using is not a minus. Its something else.
df["price2"] = pd.to_numeric(df["price"].str.replace("[R$\s\.,]", "")) / 100
df["price3"] = df["price"].str.replace("[R$\s\.,]", "").astype(float) / 100
df
A few notes:
The dot is the regex symbel for everything.
The - symbel you are using is not a minus. Its something else.
I would use something like https://regex101.com for debugging.
I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I am using Pandas and Python. My data is:
a=pd.DataFrame({'ID':[1,2,3,4,5],
'Str':['aa <aafae><afre> ht4',
'v fef <><433>',
'<1234334> <a>',
'<bijf> 04<9tu0>q4g <vie>',
'aaa 1']})
I want to extract all the sub strings between < > and merge them with blank. For example, the above example should have the result:
aafae afre
433
1234334 a
bijf 9tu0 vie
nan
So all the sub strings between < > are extracted. There will be nan if no such strings. I have already tried re library and str functions. But I am really new to regex. Could anyone help me out here.
Use pandas.Series.str.findall:
a['Str'].str.findall('<(.*?)>').apply(' '.join)
Output:
0 aafae afre
1 433
2 1234334 a
3 bijf 9tu0 vie
4
Name: Str, dtype: object
Maybe, this expression might work somewhat and to some extent:
import pandas as pd
a=pd.DataFrame({'ID':[1,2,3,4,5],
'Str':['aa <aafae><afre> ht4',
'v fef <><433>',
'<1234334> <a>',
'<bijf> 04<9tu0>q4g <vie>',
'aaa 1']})
a["new_str"]=a["Str"].str.replace(r'.*?<([^>]+)>|(?:.+)', r'\1 ',regex=True)
print(a)
I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line
I have a DataFrame with some user input (it's supposed to just be a plain email address), along with some other values, like this:
import pandas as pd
from pandas import Series, DataFrame
df = pd.DataFrame({'input': ['Captain Jean-Luc Picard <picard#starfleet.com>','deanna.troi#starfleet.com','data#starfleet.com','William Riker <riker#starfleet.com>'],'val_1':[1.5,3.6,2.4,2.9],'val_2':[7.3,-2.5,3.4,1.5]})
Due to a bug, the input sometimes has the user's name as well as brackets around the email address; this needs to be fixed before continuing with the analysis.
To move forward, I want to create a new column that has cleaned versions of the emails: if the email contains names/brackets then remove those, else just give the already correct email.
There are numerous examples of cleaning string data with Python/pandas, but I've yet to find successfully implement any of these suggestions. Here are a few examples of what I've tried:
# as noted in pandas docs, turns all non-matching strings into NaN
df['cleaned'] = df['input'].str.extract('<(.*)>')
# AttributeError: type object 'str' has no attribute 'contains'
df['cleaned'] = df['input'].apply(lambda x: str.extract('<(.*)>') if str.contains('<(.*)>') else x)
# AttributeError: 'DataFrame' object has no attribute 'str'
df['cleaned'] = df[df['input'].str.contains('<(.*)>')].str.extract('<(.*)>')
Thanks!
Use np.where to use the str.extract for those rows that contain the embedded email, for the else condition just return the 'input' value:
In [63]:
df['cleaned'] = np.where(df['input'].str.contains('<'), df['input'].str.extract('<(.*)>'), df['input'])
df
Out[63]:
input val_1 val_2 \
0 Captain Jean-Luc Picard <picard#starfleet.com> 1.5 7.3
1 deanna.troi#starfleet.com 3.6 -2.5
2 data#starfleet.com 2.4 3.4
3 William Riker <riker#starfleet.com> 2.9 1.5
cleaned
0 picard#starfleet.com
1 deanna.troi#starfleet.com
2 data#starfleet.com
3 riker#starfleet.com
If you want to use regular expressions:
import re
rex = re.compile(r'<(.*)>')
def fix(s):
m = rex.search(s)
if m is None:
return s
else:
return m.groups()[0]
fixed = df['input'].apply(fix)