pandas column float to string in python - python

I created a new column called 'order_num' for instance
import pandas
import numpy as np
import os
df=pandas.read_excel(os.getcwd() + r"/excel.xlsx", sheet=0, skiprows=0,)
df['order_num']=np.nan
and I wanted to put some value to newly created column
df.set_value(index, 'order_num', 'somestr')
and ther came error message
ValueError: could not convert string to float: 'somestr'
what is the problem? I guess defalut setting of new column creation is float. and I want to change it to string
how can I do it?

The problem is that you create a column of type float, because type(np.nan) returns float.
On a mock DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Column1': [0, 1, 2, 3, 4, 5],\
'Column2': ['a', 'b', 'c', 'd', 'e', 'f']})
If you create a new column and assign np.nan to it, the new column will be numeric:
df['numeric'] = np.nan
df['numeric'].dtype
Returns:
dtype('float64')
You could instead create a column with empty strings, i.e. '':
df['order_num'] = ''
Column1 Column2 order_num
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
5 5 f
And then add a specific string at a specific index in the column `order_num':
index = 0
df = df.set_value(index, 'order_num', 'somestr')
This will give you the expected outcome:
Column1 Column2 order_num
0 0 a somestr
1 1 b
2 2 c
3 3 d
4 4 e
5 5 f

Related

Pandas astype throwing invalid literal for int() with base 10 error

I have a pandas dataframe df whose column name and dtypes are specified in another file (read as data_dict). So to get the data properly I am using the below code:
col_list = data_dict['name'].tolist()
dtype_list = data_dict['type'].tolist()
dtype_dict = {col_list[i]: dtype_list[i] for i in range(len(col_list))}
df.columns = col_list
df = df.fillna(0)
df = df.astype(dtype_dict)
But it is throwing this error:
invalid literal for int() with base 10: '2.230'
Most of the answers I searched online recommended using pd.to_numeric() or something like df[col1].astype(float).astype(int). The issue here is that df contains 50+ columns out of which around 30 should be converted to integer type. Therefore I don't want to convert the data types one column at a time.
So how can I easily fix this error?
Try via boolean masking:
mask=df.apply(lambda x:x.str.isalpha(),1).fillna(False)
Finally:
df[~mask]=df[~mask].astype(float).astype(int)
Or
cols=df[~mask].dropna(axis=1).columns
df[cols]=df[cols].astype(float).astype(int)
df[col_list] = pd.to_numeric(df[col_list])
You can set the data type of the whole dataframe like this:
import pandas as pd
df = pd.DataFrame({'A': map(str, np.random.rand(10)), 'B': np.random.rand(10)})
df.apply(pd.to_numeric)
A B
0 0.493771 0.389934
1 0.991265 0.387819
2 0.398947 0.128031
3 0.869156 0.007609
4 0.129748 0.532235
5 0.993632 0.882933
6 0.244311 0.213737
7 0.773192 0.229257
8 0.392530 0.339418
9 0.732609 0.685258
and for just some columns like this:
df[['A', 'B']] = df[['A', 'B']].apply(pd.to_numeric)
In case you want to have a way to convert types to float for whole dataframe where you do not know which column has numbers, you can use this:
import pandas as pd
df = pd.DataFrame({'A': map(str, np.random.rand(10)), 'B': np.random.rand(10), 'C': [x for x in 'ABCDEFGHIJ']})
def to_num(df):
for col in df:
try:
df[col] = pd.to_numeric(df[col])
except:
continue
return df
df.pipe(to_num)
A B C
0 0.762027 0.095877 A
1 0.647066 0.931435 B
2 0.016939 0.806675 C
3 0.260255 0.346676 D
4 0.561694 0.551960 E
5 0.561363 0.675580 F
6 0.312432 0.498806 G
7 0.353007 0.203697 H
8 0.418549 0.128924 I
9 0.728632 0.600307 J

counting number of customers per week during 6 years [duplicate]

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)
The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.
You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2
The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)
You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2
lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

Converting strings in a dataframe column to an index value

Using Python and Pandas, I have a
dataframe, df
with a column entitled 'letters'
a list, letlist = ['a','b','c','d']
I would like to create a new column, 'letters_index', where the generate value would be the index of the string in column letters, in the list letlist
I tried
df['letters_index'] = letlist.index(df['letters'])
However, this didn't work. Do you have any suggestions?
As far as I understand, You need:
letlist = ['a', 'b', 'c', 'd']
print(df)
Output:
letters
0 a
1 b
2 b
3 d
4 c
And then
df['new_col'] = df['letters'].apply(lambda x: letlist.index(x))
Output:
0 0
1 1
2 1
3 3
4 2
Name: letters, dtype: int64
Beware that if the value in the column is not present in the list it would throw a ValueError.

Pandas: how to find and concatenate values

I'm trying to replace and add some values in pandas dataframe object. I have to following code
import pandas as pd
df = pd.DataFrame.from_items([('A', ["va-lue", "value-%", "value"]), ('B', [4, 5, 6])])
print df
df['A'] = df['A'].str.replace('%', '_0')
print df
df['A'] = df['A'].str.replace('-', '')
print df
#allmost there?
df.A[df['A'].str.contains('-')] + "_0"
How can I find the cell values in column A which contains '-' sign, replace this value with '' and add for these values a trailing '_0'? The resulting data set should look like this
A B
0 value_0 4
1 value_0 5
2 value 6
You can first keep track of the rows whose A needs to be appended with the trailing string, and perform these operations in two steps:
mask = df['A'].str.contains('-')
df['A'] = df['A'].str.replace('-|%', '')
df.ix[mask, 'A'] += '_0'
print df
Output:
A B
0 value_0 4
1 value_0 5
2 value 6

Filtering rows from pandas dataframe using concatenated strings

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3

Categories