How to parse out array from column inside a dataframe? - python

I have a data frame that looks like this:
Index Values Digits
1 [1.0,0.13,0.52...] 3
2 [1.0,0.13,0.32...] 3
3 [1.0,0.31,0.12...] 1
4 [1.0,0.30,0.20...] 2
5 [1.0,0.30,0.20...] 3
My output should be:
Index Values Digits
1 [0.33,0.04,0.17...] 3
2 [0.33,0.04,0.11...] 3
3 [0.33,0.10,0.40...] 1
4 [0.33,0.10,0.07...] 2
5 [0.33,0.10,0.07...] 3
I believe that the Values column has a np.array within the cells? Is this technically an array.
I wish to parse out the Values column and divide all values within the array by 3.
My attempts have stopped at the parsing out of the values:
a = df(df['Values'].values.tolist())

IIUC, apply the list calculation
df.Values.apply(lambda x : [y/3 for y in x])
Out[1095]:
0 [0.3333333333333333, 0.043333333333333335, 0.1...
1 [0.3333333333333333, 0.043333333333333335, 0.1...
Name: Values, dtype: object
#df.Values=df.Values.apply(lambda x : [y/3 for y in x])

Created dataframe:
import pandas as pd
d = {'col1': [[1,10], [2,20]], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
created function:
def divide_by_3(lst):
outpuut =[]
for i in lst:
outpuut.append(i/3.0)
return outpuut
apply function :
df.col1.apply(divide_by_3`)
result:
0 [0.333333333333, 3.33333333333]
1 [0.666666666667, 6.66666666667]

Related

Count values of each row in pandas dataframe only for consecutive numbers

I got a pandas dataframe that looks like this:
I want to count how many rows are for each id and print the result. The problem is I want to count that ONLY for consecutive numbers in "frame num".
For example: if frame num is: [1,2,3,45,47,122,123,124,125] and id is [1,1,1,1,1,1,1,1,1] it should print: 3 1 1 4 (and do that for EACH id).
Is there any way to do that? I got crazy trying to figure it out! To count rows for each id should be enought to use a GROUP BY. But with this new condition its difficult.
You can use pandas.DataFrame.shift() for finding consecutive numbers then use itertools.groupby for creating a list of counting consecutive.
import pandas as pd
from itertools import chain
from itertools import groupby
# Example input dataframe
df = pd.DataFrame({
'num' : [1,2,3,45,47,122,123,124,125,1,2,3,45,47,122,123,124,125],
'id' : [1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2]
})
df['s'] = (df['num']-1 == df['num'].shift()) | (df['num']+1 == df['num'].shift(-1))
res = df.groupby('id')['s'].apply(lambda g: list(chain.from_iterable([[len(list(group))] if key else [1]*len(list(group))
for key, group in groupby( g )])))
print(res)
Output:
id
1 [3, 1, 1, 4]
2 [3, 1, 1, 4]
Name: s, dtype: object
Update: Get the output as a dataframe:
>>> res.to_frame().explode('s').reset_index()
id s
0 1 3
1 1 1
2 1 1
3 1 4
4 2 3
5 2 1
6 2 1
7 2 4

How to create a new Dataframe column from a list. Pd series not working

I need to perform a calculation on a dataframe iterating over rows. For each row, the output is appended to a list and then the list is used to create a Dataframe column:
lis123 = []
...for loop on df... rate is the output value for each row
list123.append(rate)
df['new column'] = list123
doing like this I get error:
ValueError: Length of values does not match length of index
so I tried to convert list to series doing this:
df['new column'] = pd.Series(list123)
however, if I convert the list to a Series, not all of the values are picked up...for some rows, the new column is just empty. This shouldn't be the case, because I tried to perform same calculation processing single rows and all of them produce values.
I would really appreciate your help in understanding what I missing or doing wrong
thanks!
Suppose the following dataframe:
>>> df
A B C
0 1 2 2
1 1 8 9
2 3 9 1
3 6 2 4
4 9 1 4
>>> df.shape
(5, 3)
You can iterate over it in several ways:
# Over columns
>>> for i in df: print(f"{i}")
A
B
C
# Over rows
>>> for idx, sr in df.iterrows(): print(f"{idx}:\n{sr}\n")
0:
A 1
B 2
C 2
Name: 0, dtype: int64
1:
A 1
B 8
C 9
Name: 1, dtype: int64
...
# Over rows
>>> for row in df.itertuples(): print(f"{row}")
Pandas(Index=0, A=1, B=2, C=2)
Pandas(Index=1, A=1, B=8, C=9)
Pandas(Index=2, A=3, B=9, C=1)
Pandas(Index=3, A=6, B=2, C=4)
Pandas(Index=4, A=9, B=1, C=4)
You can convert the output of your loop into a new column only on the two last methods.

Append results of DataFrame apply lamda to DateFrame or new Series

I am using the apply method with a lamda to compute on each row of a Dataframe to return a Series.
statsSeries = matchData.apply(lambda row: mytest(row), axis=1)
where mytest(row) is a function that returns timestamp, float, float.
def mytest(row):
timestamp = row['timestamp']
wicketsPerOver = row['wickets']/row['overs']
runsPerWicket = row['runs']/row['wickets']
return timestamp, wicketsPerOver, runsPerWicket
As I have written it, the statsSeries contains two columns, one an index and the other a tuple of the (timestamp, wicketsPerOver, runsPerWicket).
How can I return a Series with three columns [timestamp, wicketsPerOver, runsPerWicket]?
It appears you need to use pd.Series.apply(pd.Series).
Here is a minimal example:
import pandas as pd
df = pd.DataFrame({0: [1, 2, 3, 4]})
def add_some(row):
return row[0]+1, row[0]+2, row[0]+3
df[[1, 2, 3]] = df.apply(add_some, axis=1).apply(pd.Series)
print(df)
0 1 2 3
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7

pandas filter large dataframe and order by a list

I have a large dataframe as follows:
master_df
result item
0 5 id13
1 6 id23432
2 3 id2832
3 4 id9823
......
84376253 7 id9632
And another smaller dataframe as follows:
df = pd.DataFrame({'item' : ['id9632', 'id13', 'id2832', 'id2342']})
How can I extract the relevant elements from master_df.result to match with df.item so I can achieve the following:
df = df.assign(result=list_of_results_in_order)
You can do merge also:
df = df.merge(master_df, on='item', how='left)
I think need isin with boolean indexing:
#for Series
s = master_df.loc[master_df['item'].isin(df['item']),'result']
print (s)
0 5
2 3
84376253 7
Name: result, dtype: int64
#for list
L = master_df.loc[master_df['item'].isin(df['item']),'result'].tolist()
print (L)
[5, 3, 7]
#for DataFrame
df1 = master_df[master_df['item'].isin(df['item'])]
print (df1)
result item
0 5 id13
2 3 id2832
84376253 7 id9632

Apply function to pandas DataFrame that can return multiple rows

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.

Categories