pandas rounding when converting the series to int - python

How can I round a number of decimals based on an assigned series?
My sample data is like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(1,5,size=(10,1)), columns=['Results'])
df['groups'] = ['A', 'B', 'C', 'D']
df['decimal'] = [1, 0, 2, 3]
This produces a dataframe like:
Results groups decimal
0 2.851325 A 1
1 1.397018 B 0
2 3.522660 C 2
3 1.995171 D 3
Next: each result number needs to be rounded the number of decimals shown in decimal. What I tried below resulted in an error of TypeError: cannot convert the series to <class 'int'>
df['new'] = df['Results'].round(df['decimal'])
I want the results like:
Results groups decimal new
0 2.851325 A 1 2.9
1 1.397018 B 0 1
2 3.522660 C 2 3.52
3 1.995171 D 3 1.995

You can pass a dict-like object to DataFrame.round to set different precision levels for different columns. So you need to transpose a single column DataFrame (constructed from Results column) twice:
df['Results'] = df[['Results']].T.round(df['decimal']).T
Another option is a list comprehension:
df['Results'] = [round(num, rnd) for num, rnd in zip(df['Results'], df['decimal'])]
Output:
Results groups decimal
0 2.500 A 1
1 2.000 B 0
2 2.190 C 2
3 1.243 D 3
Note that since it's a single column, it's decimal places is determined by the highest decimal; but if you look at the constructor of this DataFrame, you'll see that the precisions have indeed changed:
>>> df[['Results']].to_dict('list')
{'Results': [2.5, 2.0, 2.19, 1.243]}

Try this :
df['new']=df['Results'].copy()
df=df.round({'new': 1})

Related

How to create a new Dataframe column from a list. Pd series not working

I need to perform a calculation on a dataframe iterating over rows. For each row, the output is appended to a list and then the list is used to create a Dataframe column:
lis123 = []
...for loop on df... rate is the output value for each row
list123.append(rate)
df['new column'] = list123
doing like this I get error:
ValueError: Length of values does not match length of index
so I tried to convert list to series doing this:
df['new column'] = pd.Series(list123)
however, if I convert the list to a Series, not all of the values are picked up...for some rows, the new column is just empty. This shouldn't be the case, because I tried to perform same calculation processing single rows and all of them produce values.
I would really appreciate your help in understanding what I missing or doing wrong
thanks!
Suppose the following dataframe:
>>> df
A B C
0 1 2 2
1 1 8 9
2 3 9 1
3 6 2 4
4 9 1 4
>>> df.shape
(5, 3)
You can iterate over it in several ways:
# Over columns
>>> for i in df: print(f"{i}")
A
B
C
# Over rows
>>> for idx, sr in df.iterrows(): print(f"{idx}:\n{sr}\n")
0:
A 1
B 2
C 2
Name: 0, dtype: int64
1:
A 1
B 8
C 9
Name: 1, dtype: int64
...
# Over rows
>>> for row in df.itertuples(): print(f"{row}")
Pandas(Index=0, A=1, B=2, C=2)
Pandas(Index=1, A=1, B=8, C=9)
Pandas(Index=2, A=3, B=9, C=1)
Pandas(Index=3, A=6, B=2, C=4)
Pandas(Index=4, A=9, B=1, C=4)
You can convert the output of your loop into a new column only on the two last methods.

Pandas: count number of duplicate rows using groupby

I have a dataframe with duplicate rows
>>> d = pd.DataFrame({'n': ['a', 'a', 'a'], 'v': [1,2,1]})
>>> d
n v
0 a 1
1 a 2
2 a 1
I would like to understand how to use .groupby() method specifically so that I can add a new column to the dataframe which shows count of rows which are identical to the current one.
>>> dd = d.groupby(by=['n','v'], as_index=False) # Use all columns to find groups of identical rows
>>> for k,v in dd:
... print(k, "\n", v, "\n") # Check what we found
...
('a', 1)
n v
0 a 1
2 a 1
('a', 2)
n v
1 a 2
When I'm trying to do dd.count() on resulting DataFrameGroupBy object I get IndexError: list index out of range. This seems to happen because all columns are used in grouping operation and there's no other column to use for counting. Similarly dd.agg({'n', 'count'}) fails with ValueError: no results.
I could use .apply() to achieve something that looks like result.
>>> dd.apply(lambda x: x.assign(freq=len(x)))
n v freq
0 0 a 1 2
2 a 1 2
1 1 a 2 1
However this has two issues: 1) something happens to the index so that it is hard to map this back to the original index, 2) this does not seem idiomatic Pandas and manuals discourage using .apply() as it could be slow.
Is there more idiomatic way to count duplicate rows when using .groupby()?
One solution is use GroupBy.size for aggregate output with counter:
d = d.groupby(by=['n','v']).size().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
Your solution working if specify some column name after groupby, because no another columns n, v in input DataFrame:
d = d.groupby(by=['n','v'])['n'].count().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
What is also necessary if need new column with GroupBy.transform - new column is filled by aggregate values:
d['c'] = d.groupby(by=['n','v'])['n'].transform('size')
print (d)
n v c
0 a 1 2
1 a 2 1
2 a 1 2

How to parse out array from column inside a dataframe?

I have a data frame that looks like this:
Index Values Digits
1 [1.0,0.13,0.52...] 3
2 [1.0,0.13,0.32...] 3
3 [1.0,0.31,0.12...] 1
4 [1.0,0.30,0.20...] 2
5 [1.0,0.30,0.20...] 3
My output should be:
Index Values Digits
1 [0.33,0.04,0.17...] 3
2 [0.33,0.04,0.11...] 3
3 [0.33,0.10,0.40...] 1
4 [0.33,0.10,0.07...] 2
5 [0.33,0.10,0.07...] 3
I believe that the Values column has a np.array within the cells? Is this technically an array.
I wish to parse out the Values column and divide all values within the array by 3.
My attempts have stopped at the parsing out of the values:
a = df(df['Values'].values.tolist())
IIUC, apply the list calculation
df.Values.apply(lambda x : [y/3 for y in x])
Out[1095]:
0 [0.3333333333333333, 0.043333333333333335, 0.1...
1 [0.3333333333333333, 0.043333333333333335, 0.1...
Name: Values, dtype: object
#df.Values=df.Values.apply(lambda x : [y/3 for y in x])
Created dataframe:
import pandas as pd
d = {'col1': [[1,10], [2,20]], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
created function:
def divide_by_3(lst):
outpuut =[]
for i in lst:
outpuut.append(i/3.0)
return outpuut
apply function :
df.col1.apply(divide_by_3`)
result:
0 [0.333333333333, 3.33333333333]
1 [0.666666666667, 6.66666666667]

Is there a way to do a Series.map in place, but keep original value if no match?

The scenario here is that I've got a dataframe df with raw integer data, and a dict map_array which maps those ints to string values.
I need to replace the values in the dataframe with the corresponding values from the map, but keep the original value if the it doesn't map to anything.
So far, the only way I've been able to figure out how to do what I want is by using a temporary column. However, with the size of data that I'm working with, this could sometimes get a little bit hairy. And so, I was wondering if there was some trick to do this in pandas without needing the temp column...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,5, size=(100,1)))
map_array = {1:'one', 2:'two', 4:'four'}
df['__temp__'] = df[0].map(map_array, na_action=None)
#I've tried varying the na_action arg to no effect
nan_index = data['__temp__'][df['__temp__'].isnull() == True].index
df['__temp__'].ix[nan_index] = df[0].ix[nan_index]
df[0] = df['__temp__']
df = df.drop(['__temp__'], axis=1)
I think you can simply use .replace, whether on a DataFrame or a Series:
>>> df = pd.DataFrame(np.random.randint(1,5, size=(3,3)))
>>> df
0 1 2
0 3 4 3
1 2 1 2
2 4 2 3
>>> map_array = {1:'one', 2:'two', 4:'four'}
>>> df.replace(map_array)
0 1 2
0 3 four 3
1 two one two
2 four two 3
>>> df.replace(map_array, inplace=True)
>>> df
0 1 2
0 3 four 3
1 two one two
2 four two 3
I'm not sure what the memory hit of changing column dtypes will be, though.

Apply function to pandas DataFrame that can return multiple rows

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.

Categories