I have this dataframe:
cnpj Porte
0 453232000125
1 11543123000156
2 345676
3 121234561023456
'cnpj' is currently as float.
If cnpj has '0001' in it, I want to classify 'Porte' as A. So it looks like this:
cnpj Porte
0 453232000125 A
1 11543123000156 A
2 345676
3 121234561023456
I'm trying:
df['Porte'].loc[(df['cnpj'].astype(int).astype(str).str.contains('0001'))]='A'
But it gets me this error:
TypeError: cannot convert the series to <class 'int'>
How could I do that?
This is one approach.
Demo:
import pandas as pd
import numpy as np
df = pd.DataFrame({"cnpj": [453232000125, 11543123000156, 345676]})
df["Porte"] = df["cnpj"].apply(lambda x: "A" if '0001' in str(x) else np.nan)
print(df)
Output:
cnpj Porte
0 453232000125 A
1 11543123000156 A
2 345676 NaN
Another approach:
df = pd.DataFrame({"cnpj": [453232000125, 11543123000156, 345676, 121234561023456]})
df['Porte'] = np.where(df['cnpj'].astype(str).str.contains('0001'), 'A', '')
Output:
cnpj Porte
0 453232000125 A
1 11543123000156 A
2 345676
3 121234561023456
You were very close. Just remove astype(int) statement.
df['Porte'].loc[(df['cnpj'].astype(str).str.contains('0001')]='A'
The second parameter passed to loc method also could be the column you want to update and below is another way to achieve your requirement.
df.loc[df['cnpj'].astype(str).str.contains('0001'), 'Porte'] = "A"
Related
Does anyone know how I'd format this string (which is a column in a dataframe) to be a float so I can sort by the column please?
£880,000
£88,500
£850,000
£845,000
i.e. I want this to become
88,500
845,000
850,000
880,000
Thanks in advance!
Assuming 'col' the column name.
If you just want to sort, and keep as string, you can use natsorted:
from natsort import natsort_key
df.sort_values(by='col', key=natsort_key)
# OR
from natsort import natsort_keygen
df.sort_values(by='col', key=natsort_keygen())
output:
col
1 £88,500
3 £845,000
2 £850,000
0 £880,000
If you want to convert to floats:
df['col'] = pd.to_numeric(df['col'].str.replace('[^\d.]', '', regex=True))
df.sort_values(by='col')
output:
col
1 88500
3 845000
2 850000
0 880000
If you want strings, you can use str.lstrip:
df['col'] = df['col'].str.lstrip('£')
output:
col
0 880,000
1 88,500
2 850,000
3 845,000
I have some columns that have the same names. I would like to add a 1 to the repeating column names
Data
Date Type hi hello stat hi hello
1/1/2022 a 0 0 1 1 0
Desired
Date Type hi hello stat hi1 hello1
1/1/2022 a 0 0 1 1 0
Doing
mask = df['col2'].duplicated(keep=False)
I believe I can utilize mask, but not sure how to efficiently achieve this without calling out the actual column. I would like to call the full dataset and allow the algorithm to update the dupe.
Any suggestion is appreciated
Use the built-in parser method _maybe_dedup_names():
df.columns = pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(df.columns)
# Date Type hi hello stat hi.1 hello.1
# 0 1/1/2022 a 0 0 1 1 0
This is what pandas uses to deduplicate column headers from read_csv().
Note that it scales to any number of duplicate names:
cols = ['hi'] * 3 + ['hello'] * 5
pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(cols)
# ['hi', 'hi.1', 'hi.2', 'hello', 'hello.1', 'hello.2', 'hello.3', 'hello.4']
In pandas < 1.3:
df.columns = pd.io.parsers.ParserBase({})._maybe_dedup_names(df.columns)
You need to apply the duplicated operation to the column names. And then map the duplication information to a string, which you can then add to the original column names.
df.columns = df.columns+[{False:'',True:'1'}[x] for x in df.columns.duplicated()]
We can do
s = df.columns.to_series().groupby(df.columns).cumcount().replace({0:''}).astype(str).radd('.')
df.columns = (df.columns + s).str.strip('.')
df
Out[153]:
Date Type hi hello stat hi.1 hello.1
0 1/1/2022 a 0 0 1 1 0
I am using pandas and I have a column that has numbers but when I check for datatype I get the column is an object. I think one of the rows in that column is actually a string. How can I find out which row is the string? For example:
Name A B
John 0 1
Rich 1 0
Jim O 1
Jim has the letter "O" instead of zero on column A. what can I use in pandas to find which row has the string instead of the number if I have thousands of rows? In this example I used the letter O, but it could be any letter, really.
The dtype object means that the column holds generic Python-typed values.
Those values can be any type Python knows—an int, a str, a list of sets of some custom namedtuple type that you created, whatever.
And you can just call normal Python functions or methods on those objects (e.g., by accessing them directly, or via Pandas' apply) the same way you do with any other Python variables.
And that includes the type function, the isinstance function, etc.:
>>> df = pd.DataFrame({'A': [0, 1, 'O'], 'B': [1, 0, 1]})
>>> df.A
0 0
1 1
2 O
Name: A, dtype: object
>>> df.A.apply(type)
0 <class 'int'>
1 <class 'int'>
2 <class 'str'>
Name: A, dtype: object
>>> df.A.apply(lambda x: isinstance(x, str))
0 False
1 False
2 True
Name: A, dtype: bool
>>> df.A.apply(repr)
0 0
1 1
2 'O'
Name: A, dtype: object
… and so on.
You can use pandas.to_numeric to see what doesn't get converted to a number. Then with .isnull() you can subset your original df to see exactly which rows are the problematic ones.
import pandas as pd
df[pd.to_numeric(df.A, errors='coerce').isnull()]
# Name A B
#2 Jim O 1
If you're not sure which column is problematic, you could so something like (assuming you want to check everything other than the 1st name column):
df2 = pd.DataFrame()
for col in df.columns[1::]:
df2[col] = pd.to_numeric(df[col], errors='coerce')
df[df2.isnull().sum(axis=1).astype(bool)]
# Name A B
#2 Jim O 1
I'd like to add another very short and concise solution which would be a combination of ALollz and abarnert.
First let's find all columns that are of type object with cols = (df.dtypes == 'object').nonzero()[0]. Let us filter those out using iloc and apply pd.to_numeric (and let us also not include the name column using a slice of the col variable [1:]). Then we check for na-values and if any(1) (row-wise) then we return back the index of that row.
Full example:
import pandas as pd
data = '''\
Name A B C
John 0 1 O
Rich 1 0 2
Jim O 1 O'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
cols = (df.dtypes == 'object').nonzero()[0]
rows = df.iloc[:,cols[1:]].apply(pd.to_numeric, errors='coerce').isna().any(1).nonzero()[0]
print(rows)
Returns:
[0 2] # <-- Means that row 0 and 2 contain N/A-values in at least 1 column
This answers your question: what can I use in pandas to find which row has the string instead of the number but for all columns looking for strings by assuming they can't be converted to numbers with pd.to_numeric.
types = list(df['A'].apply(lambda x: type(x)))
names = list(df['Name'])
d = dict(zip(names, types))
This will give you a dictionary of {name:type} so you know which name has a string value in column A. Alternatively, if you just want to find the row the string is on, use this:
types = list(df['A'].apply(lambda x: type(x)))
rows = df.index.tolist()
d = dict(zip(rows, types))
# to get only the rows that have string values in column A
d = {k:v for k,v in d.items() if v == str}
Given this DataFrame:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'Date':['20/03/17 10:30:34','20/03/17 10:31:24','20/03/17 10:34:34'],
'Value':[4,7,5]})
df['Date'] = pd.to_datetime(df.Date)
df
Out[53]:
Date Value
0 2017-03-20 10:30:34 4
1 2017-03-20 10:31:24 7
2 2017-03-20 10:34:34 5
Im am trying to extract the max value and its index. I can get the max value by df.Value.max() but when I use df.idxmax() to get the Index fo the value I get a TypeError:
TypeError: float() argument must be a string or a number
Is there any other way to get the Index of the Max value of a Dataframe? (Or any way to correct this one?)
Because it should be:
df.Value.idxmax()
It then returns 1.
If you only care about the Value column, you can use:
df.Value.idxmax()
>>> 1
However, it is strange that it fails on both columns with df.idxmax() as the following works, too:
df.Date.idxmax()
>>> 2
df.idxmax() also works for some other dummy data:
dummy = pd.DataFrame(np.random.random(size=(5,2)))
print(dummy)
0 1
0 0.944017 0.365198
1 0.541003 0.447632
2 0.583375 0.081192
3 0.492935 0.570310
4 0.832320 0.542983
print(dummy.idxmax())
0 0
1 3
dtype: int64
You have to specify from which column you want to get maximum value-idx.
To get the idx of maxumum value use:
df.Value.idxmax()
if you want to get the idx of maximum Date use:
df.Date.idxmax()
I want my dataframe to auto-truncate strings which are longer than a certain length.
basically:
pd.set_option('auto_truncate_string_exceeding_this_length', 255)
Any ideas? I have hundreds of columns and don't want to iterate over every data point. If this can be achieved during import that would also be fine (e.g. pd.read_csv())
Thanks.
pd.set_option('display.max_colwidth', 255)
You can use read_csv converters. Lets say you want to truncate column name abc, you can pass a dictionary with function like
def auto_truncate(val):
return val[:255]
df = pd.read_csv('file.csv', converters={'abc': auto_truncate}
If you have columns with different lengths
df = pd.read_csv('file.csv', converters={'abc': lambda: x: x[:255], 'xyz': lambda: x: x[:512]}
Make sure column type is string. Column index can also be used instead of name in converters dict.
I'm not sure you can do this on the whole df, the following would work after loading:
In [21]:
df = pd.DataFrame({"a":['jasjdhadasd']*5, "b":arange(5)})
df
Out[21]:
a b
0 jasjdhadasd 0
1 jasjdhadasd 1
2 jasjdhadasd 2
3 jasjdhadasd 3
4 jasjdhadasd 4
In [22]:
for col in df:
if is_string_like(df[col]):
df[col] = df[col].str.slice(0,5)
df
Out[22]:
a b
0 jasjd 0
1 jasjd 1
2 jasjd 2
3 jasjd 3
4 jasjd 4
EDIT
I think if you specified the dtypes in the args to read_csv then you could set the max length:
df = pd.read_csv('file.csv', dtype=(np.str, maxlen))
I will try this and confirm shortly
UPDATE
Sadly you cannot specify the length, an error is raised if you try this:
NotImplementedError: the dtype <U5 is not supported for parsing
when attempting to pass the arg dtype=(str,5)
You can also simply truncate a single column with
df['A'] = df['A'].str[:255]