Does anyone know how I'd format this string (which is a column in a dataframe) to be a float so I can sort by the column please?
£880,000
£88,500
£850,000
£845,000
i.e. I want this to become
88,500
845,000
850,000
880,000
Thanks in advance!
Assuming 'col' the column name.
If you just want to sort, and keep as string, you can use natsorted:
from natsort import natsort_key
df.sort_values(by='col', key=natsort_key)
# OR
from natsort import natsort_keygen
df.sort_values(by='col', key=natsort_keygen())
output:
col
1 £88,500
3 £845,000
2 £850,000
0 £880,000
If you want to convert to floats:
df['col'] = pd.to_numeric(df['col'].str.replace('[^\d.]', '', regex=True))
df.sort_values(by='col')
output:
col
1 88500
3 845000
2 850000
0 880000
If you want strings, you can use str.lstrip:
df['col'] = df['col'].str.lstrip('£')
output:
col
0 880,000
1 88,500
2 850,000
3 845,000
Related
I have some columns that have the same names. I would like to add a 1 to the repeating column names
Data
Date Type hi hello stat hi hello
1/1/2022 a 0 0 1 1 0
Desired
Date Type hi hello stat hi1 hello1
1/1/2022 a 0 0 1 1 0
Doing
mask = df['col2'].duplicated(keep=False)
I believe I can utilize mask, but not sure how to efficiently achieve this without calling out the actual column. I would like to call the full dataset and allow the algorithm to update the dupe.
Any suggestion is appreciated
Use the built-in parser method _maybe_dedup_names():
df.columns = pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(df.columns)
# Date Type hi hello stat hi.1 hello.1
# 0 1/1/2022 a 0 0 1 1 0
This is what pandas uses to deduplicate column headers from read_csv().
Note that it scales to any number of duplicate names:
cols = ['hi'] * 3 + ['hello'] * 5
pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(cols)
# ['hi', 'hi.1', 'hi.2', 'hello', 'hello.1', 'hello.2', 'hello.3', 'hello.4']
In pandas < 1.3:
df.columns = pd.io.parsers.ParserBase({})._maybe_dedup_names(df.columns)
You need to apply the duplicated operation to the column names. And then map the duplication information to a string, which you can then add to the original column names.
df.columns = df.columns+[{False:'',True:'1'}[x] for x in df.columns.duplicated()]
We can do
s = df.columns.to_series().groupby(df.columns).cumcount().replace({0:''}).astype(str).radd('.')
df.columns = (df.columns + s).str.strip('.')
df
Out[153]:
Date Type hi hello stat hi.1 hello.1
0 1/1/2022 a 0 0 1 1 0
Trying to split and parse characters from an column and submit the parsed data into different column .
I was trying the same by parsing with _ in the given column data, It was working good until the number of '_' present in the string was fixed to 2.
Input Data:
Col1
U_a65839_Jan87Apr88
U_b98652_Feb88Apr88_(2).jpg.pdf
V_C56478_mar89Apr89
Q_d15634_Apr90Apr91
Q_d15634_Apr90Apr91_(3).jpeg.pdf
S_e15336_may91Apr93
NaN
Expected Output:
col2
Jan87Apr88
Feb88Apr88
mar89Apr89
Apr90Apr91
Apr90Apr91
may91Apr93
Code i have been trying :
df = pd.read_excel(open(r'Dats.xlsx', 'rb'), sheet_name='Sheet1')
df['Col2'] = df.Col1.str.replace(
'.*_', '', regex=True
)
print(df['Col2'])
I think you want this:
col2 = df.Col1.str.split("_", expand=True)[2]
output:
0 Jan87Apr88
1 Feb88Apr88
2 mar89Apr89
3 Apr90Apr91
4 Apr90Apr91
5 may91Apr93
6 NaN
(you can dropna if you don't want the last row)
Use str.extract here:
df["col2"] = df["Col1"].str.extract(r'((?:[a-z]{3}\d{2}){2})', flags=re.IGNORECASE)
Demo
Based on your question, the pandas DataFrame apply can be a good solution:
First, clean the DataFrame by replacing NaNs with empty string ''
df = pd.DataFrame(data=['U_a65839_Jan87Apr88', 'U_b98652_Feb88Apr88_(2).jpg.pdf', 'V_C56478_mar89Apr89', 'Q_d15634_Apr90Apr91', 'Q_d15634_Apr90Apr91_(3).jpeg.pdf', 'S_e15336_may91Apr93', None], columns=['Col1'])
df = df.fillna('')
Col1
0 U_a65839_Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf
2 V_C56478_mar89Apr89
3 Q_d15634_Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf
5 S_e15336_may91Apr93
6
Next, define a function to extract the required string with regex
def fun(s):
import re
m = re.search(r'\w{3}\d{2}\w{3}\d{2}', s)
if m:
return m.group(0)
else:
return ''
Then, easily apply the function to DataFrame:
df['Col2'] = df['Col1'].apply(fun)
Col1 Col2
0 U_a65839_Jan87Apr88 Jan87Apr88
1 U_b98652_Feb88Apr88_(2).jpg.pdf Feb88Apr88
2 V_C56478_mar89Apr89 mar89Apr89
3 Q_d15634_Apr90Apr91 Apr90Apr91
4 Q_d15634_Apr90Apr91_(3).jpeg.pdf Apr90Apr91
5 S_e15336_may91Apr93 may91Apr93
6
Hope the above helps.
I am using pandas and I have a column that has numbers but when I check for datatype I get the column is an object. I think one of the rows in that column is actually a string. How can I find out which row is the string? For example:
Name A B
John 0 1
Rich 1 0
Jim O 1
Jim has the letter "O" instead of zero on column A. what can I use in pandas to find which row has the string instead of the number if I have thousands of rows? In this example I used the letter O, but it could be any letter, really.
The dtype object means that the column holds generic Python-typed values.
Those values can be any type Python knows—an int, a str, a list of sets of some custom namedtuple type that you created, whatever.
And you can just call normal Python functions or methods on those objects (e.g., by accessing them directly, or via Pandas' apply) the same way you do with any other Python variables.
And that includes the type function, the isinstance function, etc.:
>>> df = pd.DataFrame({'A': [0, 1, 'O'], 'B': [1, 0, 1]})
>>> df.A
0 0
1 1
2 O
Name: A, dtype: object
>>> df.A.apply(type)
0 <class 'int'>
1 <class 'int'>
2 <class 'str'>
Name: A, dtype: object
>>> df.A.apply(lambda x: isinstance(x, str))
0 False
1 False
2 True
Name: A, dtype: bool
>>> df.A.apply(repr)
0 0
1 1
2 'O'
Name: A, dtype: object
… and so on.
You can use pandas.to_numeric to see what doesn't get converted to a number. Then with .isnull() you can subset your original df to see exactly which rows are the problematic ones.
import pandas as pd
df[pd.to_numeric(df.A, errors='coerce').isnull()]
# Name A B
#2 Jim O 1
If you're not sure which column is problematic, you could so something like (assuming you want to check everything other than the 1st name column):
df2 = pd.DataFrame()
for col in df.columns[1::]:
df2[col] = pd.to_numeric(df[col], errors='coerce')
df[df2.isnull().sum(axis=1).astype(bool)]
# Name A B
#2 Jim O 1
I'd like to add another very short and concise solution which would be a combination of ALollz and abarnert.
First let's find all columns that are of type object with cols = (df.dtypes == 'object').nonzero()[0]. Let us filter those out using iloc and apply pd.to_numeric (and let us also not include the name column using a slice of the col variable [1:]). Then we check for na-values and if any(1) (row-wise) then we return back the index of that row.
Full example:
import pandas as pd
data = '''\
Name A B C
John 0 1 O
Rich 1 0 2
Jim O 1 O'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
cols = (df.dtypes == 'object').nonzero()[0]
rows = df.iloc[:,cols[1:]].apply(pd.to_numeric, errors='coerce').isna().any(1).nonzero()[0]
print(rows)
Returns:
[0 2] # <-- Means that row 0 and 2 contain N/A-values in at least 1 column
This answers your question: what can I use in pandas to find which row has the string instead of the number but for all columns looking for strings by assuming they can't be converted to numbers with pd.to_numeric.
types = list(df['A'].apply(lambda x: type(x)))
names = list(df['Name'])
d = dict(zip(names, types))
This will give you a dictionary of {name:type} so you know which name has a string value in column A. Alternatively, if you just want to find the row the string is on, use this:
types = list(df['A'].apply(lambda x: type(x)))
rows = df.index.tolist()
d = dict(zip(rows, types))
# to get only the rows that have string values in column A
d = {k:v for k,v in d.items() if v == str}
I have this dataframe:
cnpj Porte
0 453232000125
1 11543123000156
2 345676
3 121234561023456
'cnpj' is currently as float.
If cnpj has '0001' in it, I want to classify 'Porte' as A. So it looks like this:
cnpj Porte
0 453232000125 A
1 11543123000156 A
2 345676
3 121234561023456
I'm trying:
df['Porte'].loc[(df['cnpj'].astype(int).astype(str).str.contains('0001'))]='A'
But it gets me this error:
TypeError: cannot convert the series to <class 'int'>
How could I do that?
This is one approach.
Demo:
import pandas as pd
import numpy as np
df = pd.DataFrame({"cnpj": [453232000125, 11543123000156, 345676]})
df["Porte"] = df["cnpj"].apply(lambda x: "A" if '0001' in str(x) else np.nan)
print(df)
Output:
cnpj Porte
0 453232000125 A
1 11543123000156 A
2 345676 NaN
Another approach:
df = pd.DataFrame({"cnpj": [453232000125, 11543123000156, 345676, 121234561023456]})
df['Porte'] = np.where(df['cnpj'].astype(str).str.contains('0001'), 'A', '')
Output:
cnpj Porte
0 453232000125 A
1 11543123000156 A
2 345676
3 121234561023456
You were very close. Just remove astype(int) statement.
df['Porte'].loc[(df['cnpj'].astype(str).str.contains('0001')]='A'
The second parameter passed to loc method also could be the column you want to update and below is another way to achieve your requirement.
df.loc[df['cnpj'].astype(str).str.contains('0001'), 'Porte'] = "A"
I want my dataframe to auto-truncate strings which are longer than a certain length.
basically:
pd.set_option('auto_truncate_string_exceeding_this_length', 255)
Any ideas? I have hundreds of columns and don't want to iterate over every data point. If this can be achieved during import that would also be fine (e.g. pd.read_csv())
Thanks.
pd.set_option('display.max_colwidth', 255)
You can use read_csv converters. Lets say you want to truncate column name abc, you can pass a dictionary with function like
def auto_truncate(val):
return val[:255]
df = pd.read_csv('file.csv', converters={'abc': auto_truncate}
If you have columns with different lengths
df = pd.read_csv('file.csv', converters={'abc': lambda: x: x[:255], 'xyz': lambda: x: x[:512]}
Make sure column type is string. Column index can also be used instead of name in converters dict.
I'm not sure you can do this on the whole df, the following would work after loading:
In [21]:
df = pd.DataFrame({"a":['jasjdhadasd']*5, "b":arange(5)})
df
Out[21]:
a b
0 jasjdhadasd 0
1 jasjdhadasd 1
2 jasjdhadasd 2
3 jasjdhadasd 3
4 jasjdhadasd 4
In [22]:
for col in df:
if is_string_like(df[col]):
df[col] = df[col].str.slice(0,5)
df
Out[22]:
a b
0 jasjd 0
1 jasjd 1
2 jasjd 2
3 jasjd 3
4 jasjd 4
EDIT
I think if you specified the dtypes in the args to read_csv then you could set the max length:
df = pd.read_csv('file.csv', dtype=(np.str, maxlen))
I will try this and confirm shortly
UPDATE
Sadly you cannot specify the length, an error is raised if you try this:
NotImplementedError: the dtype <U5 is not supported for parsing
when attempting to pass the arg dtype=(str,5)
You can also simply truncate a single column with
df['A'] = df['A'].str[:255]