Unpack a function into a data frame - python

I have a function which returns two list, so a can save those in two variables like:
list_a,list_b = my_function(input)
I want to save this directly into a dataframe, something like this:
df[['list_a','list_b']] = my_function(input)
I got the following error:
array is not broadcastable to correct shape

Use
df['B'], df['C'] = my_function()
to unpack the tuple of lists returned by my_function and assign the lists to df['B'] and df['C']:
import pandas as pd
N = 5
def my_function():
return [10]*N, [20]*N
df = pd.DataFrame({'A':[1]*N})
df['B'], df['C'] = my_function()
yields
A B C
0 1 10 20
1 1 10 20
2 1 10 20
3 1 10 20
4 1 10 20
Note that the lengths of the lists returned by my_function must match the length of df.

import pandas as pd
list_a, list_b = my_function(input)
df = pd.DataFrame([list_a, list_b], columns=['a','b'])
or combined in to one line:
df = pd.DataFrame(list(my_function(input)), columns=['a','b'])

Related

Iterating over lists produces unexpected results

In the first example below, I am iterating over a list of dataframes. The For loop creates column 'c'. Printing each df shows that both elements in the list were updated.
In the second example, I am iterating over a list of variables. The For loop applys some math to each element. But when printing, the list does not reflect the changes made in the For loop.
Please help me to understand why the elements in the second example are not being impacted by the For loop, like they are in the first example.
import pandas as pd
df1 = pd.DataFrame([[1,2],[3,4]], columns=['a', 'b'])
df2 = pd.DataFrame([[3,4],[5,6]], columns=['a', 'b'])
dfs = [df1, df2]
for df in dfs:
df['c'] = df['a'] + df['b']
print(df1)
print(df2)
result:
a b c
0 1 2 3
1 3 4 7
a b c
0 3 4 7
1 5 6 11
Second example:
a, b = 2, 3
test = [a, b]
for x in test:
x = x * 2
print(test)
result: [2, 3]
expected result: [4, 6]
In your second example, test is a list of ints which are not mutable. If you want a similar effect to your first snippet, you will have to store something mutable in your list:
a, b = 2, 3
test = [[a], [b]]
for x in test:
x[0] = x[0] * 2
print(test)
Output: [[4], [6]]
When you iterate in a list like this x takes the value at the current position.
for x in test:
x = x * 2
When you try to assign a new value to x you are not changing the element in the list, you are changing what the variable x contains.
To change the actual value in the list iterate by index:
for i in range(len(test)):
test[i] = test[i] * 2

Use previous row value for calculating log

I have a Dataframe as presented in the Spreadsheet, It has a column A.
https://docs.google.com/spreadsheets/d/1h3ED1FbkxQxyci0ETQio8V4cqaAOC7bIJ5NvVx41jA/edit?usp=sharing
I have been trying to create a new column like A_output which uses the previous row value and current row value for finding the Natual Log.
df.apply(custom_function, axix=1) #on a function
But I am not sure, How to access the previous value of the row?
The only thing I have tried is converting the values into the list and perform my operation and appending it back to the dataframe something like this.
output = []
previous_value = 100
for value in df['A'].values:
output.append(np.log(value/previous_value))
previous_value = value
df['A_output'] = output
This is going to be extremely expensive operation, What's the best way to approach this problem?
Another way with rolling():
import pandas as pd
import numpy as np
data = np.random.normal(loc=5., size=(6, 1))
df = pd.DataFrame(columns=['A'], data=data)
df['output'] = df['A'].rolling(2).apply(lambda x: np.log(x[1]/x[0]))
init_val = 3.
df['output'][0] = np.log(df['A'][0] / init_val) # <-- manually assign value for the first item
print(df)
# A output
# 0 7.257160 0.883376
# 1 4.579390 -0.460423
# 2 4.630148 0.011023
# 3 5.153198 0.107029
# 4 6.004917 0.152961
# 5 6.633857 0.099608
If you want to apply the same operation on multiple columns:
import pandas as pd
import numpy as np
data = np.random.normal(loc=5., size=(6, 2))
df = pd.DataFrame(columns=['A', 'B'], data=data)
df[['output_A', 'output_B']] = df.rolling(2).apply(lambda x: np.log(x[1]/x[0]))
init_val = 3.
df['output_A'][0] = np.log(df['A'][0] / init_val)
df['output_B'][0] = np.log(df['B'][0] / init_val)
print(df)
# A B output_A output_B
# 0 7.289657 4.986245 0.887844 0.508071
# 1 5.690721 5.010605 -0.247620 0.004874
# 2 5.773812 5.129814 0.014495 0.023513
# 3 4.417981 6.395500 -0.267650 0.220525
# 4 4.923170 5.363723 0.108270 -0.175936
# 5 5.279008 5.327365 0.069786 -0.006802
We can use Series.shift and after use .loc to assign the first value with the base value
Let's assume we have the following dataframe:
df = pd.DataFrame({'A':np.random.randint(1, 10, 5)})
print(df)
A
0 8
1 3
2 3
3 1
4 5
df['A_output'] = np.log(df['A'] / df['A'].shift())
df.loc[0, 'A_output'] = np.log(df.loc[0, 'A'] / 100)
print(df)
A A_output
0 8 -2.525729
1 3 -0.980829
2 3 0.000000
3 1 -1.098612
4 5 1.609438

Need to calculate columns from CSV using pandas

incidentcountlevel1 and examcount were two column names on CSV file. I want to calculate two columns based on these. I have written the script below but it's failing:
import pandas as pd
import numpy as np
import time, os, fnmatch, shutil
df = pd.read_csv(r"/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',skiprows=[1])
df1 = pd.read_csv(r"/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',skiprows=[1])
df3 = pd.read_csv("/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',converters={"incidentcountlevel1":int})
inc_count_lvl_1 = df3.loc[:, ['incidentcountlevel1']]
exam_count=df3.loc[:, ['examcount']]
for exam_count in exam_count: #need to iterate this col to calculate for each row
if exam_count < 1:
print "IPTE Cannot be calculated"
else:
if inc_count_lvl_1 > 5:
ipte1= (inc_count_lvl_1/exam_count)*1000
else:
dof = 2*(inc_count_lvl_1+ 1)
chi_square=chi2.ppf(0.5,dof)
ipte1=(chi_square/(2*exam_count))×1000
You can apply lamda function on pandas column.
Just created an example using numpy. You can change according to your case
>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 50]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 1500
or you can create your own function:
>>> def fx(x, y):
... return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 1500
I your case, the solution might look like this.
df['new_column'] = np.vectorize(fx)(df['examcount'], df['incidentcountlevel1'])
def fx(exam_count,inc_count_lvl_1):
if exam_count < 1:
return -1 ##whatever you want
else:
if inc_count_lvl_1 > 5:
ipte1= (inc_count_lvl_1/exam_count)*1000
else:
dof = 2*(inc_count_lvl_1+ 1)
chi_square=chi2.ppf(0.5,dof)
ipte1=(chi_square/(2*exam_count))×1000
return ipte1
If you dont want to use lamda fucntions then you can use iterrows.
iterrows is a generator which yield both index and row.
for index, row in df.iterrows():
print row['examcount'], row['incidentcountlevel1']
#do your stuff.
I hope it helps.

Appending a list or series to a pandas DataFrame as a row?

So I have initialized an empty pandas DataFrame and I would like to iteratively append lists (or Series) as rows in this DataFrame. What is the best way of doing this?
df = pd.DataFrame(columns=list("ABC"))
df.loc[len(df)] = [1,2,3]
Sometimes it's easier to do all the appending outside of pandas, then, just create the DataFrame in one shot.
>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
col1 col2
0 a b
1 e f
Here's a simple and dumb solution:
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df = df.append({'foo':1, 'bar':2}, ignore_index=True)
Could you do something like this?
>>> import pandas as pd
>>> df = pd.DataFrame(columns=['col1', 'col2'])
>>> df = df.append(pd.Series(['a', 'b'], index=['col1','col2']), ignore_index=True)
>>> df = df.append(pd.Series(['d', 'e'], index=['col1','col2']), ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
Does anyone have a more elegant solution?
Following onto Mike Chirico's answer... if you want to append a list after the dataframe is already populated...
>>> list = [['f','g']]
>>> df = df.append(pd.DataFrame(list, columns=['col1','col2']),ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
2 f g
There are several ways to append a list to a Pandas Dataframe in Python. Let's consider the following dataframe and list:
import pandas as pd
# Dataframe
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["col1", "col2"])
# List to append
list = [5, 6]
Option 1: append the list at the end of the dataframe with pandas.DataFrame.loc.
df.loc[len(df)] = list
Option 2: convert the list to dataframe and append with pandas.DataFrame.append().
df = df.append(pd.DataFrame([list], columns=df.columns), ignore_index=True)
Option 3: convert the list to series and append with pandas.DataFrame.append().
df = df.append(pd.Series(list, index = df.columns), ignore_index=True)
Each of the above options should output something like:
>>> print (df)
col1 col2
0 1 2
1 3 4
2 5 6
Reference : How to append a list as a row to a Pandas DataFrame in Python?
Converting the list to a data frame within the append function works, also when applied in a loop
import pandas as pd
mylist = [1,2,3]
df = pd.DataFrame()
df = df.append(pd.DataFrame(data[mylist]))
Here's a function that, given an already created dataframe, will append a list as a new row. This should probably have error catchers thrown in, but if you know exactly what you're adding then it shouldn't be an issue.
import pandas as pd
import numpy as np
def addRow(df,ls):
"""
Given a dataframe and a list, append the list as a new row to the dataframe.
:param df: <DataFrame> The original dataframe
:param ls: <list> The new row to be added
:return: <DataFrame> The dataframe with the newly appended row
"""
numEl = len(ls)
newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))
df = df.append(newRow, ignore_index=True)
return df
If you want to add a Series and use the Series' index as columns of the DataFrame, you only need to append the Series between brackets:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame()
In [3]: row=pd.Series([1,2,3],["A","B","C"])
In [4]: row
Out[4]:
A 1
B 2
C 3
dtype: int64
In [5]: df.append([row],ignore_index=True)
Out[5]:
A B C
0 1 2 3
[1 rows x 3 columns]
Whitout the ignore_index=True you don't get proper index.
simply use loc:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
As mentioned here - https://kite.com/python/answers/how-to-append-a-list-as-a-row-to-a-pandas-dataframe-in-python, you'll need to first convert the list to a series then append the series to dataframe.
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["a", "b"])
to_append = [5, 6]
a_series = pd.Series(to_append, index = df.columns)
df = df.append(a_series, ignore_index=True)
Consider an array A of N x 2 dimensions. To add one more row, use the following.
A.loc[A.shape[0]] = [3,4]
The simplest way:
my_list = [1,2,3,4,5]
df['new_column'] = pd.Series(my_list).values
Edit:
Don't forget that the length of the new list should be the same of the corresponding Dataframe.

Create a Pandas DataFrame from series without duplicating their names?

Is it possible to create a DataFrame from a list of series without duplicating their names?
Ex, creating the same DataFrame as:
>>> pd.DataFrame({ "foo": data["foo"], "bar": other_data["bar"] })
But without without needing to explicitly name the columns?
Try pandas.concat which takes a list of items to combine as its argument:
df1 = pd.DataFrame(np.random.randn(100, 4), columns=list('abcd'))
df2 = pd.DataFrame(np.random.randn(100, 3), columns=list('xyz'))
df3 = pd.concat([df1['a'], df2['y']], axis=1)
Note that you need to use axis=1 to stack things together side-by side and axis=0 (which is the default) to combine them one-over-the-other.
Seems like you want to join the dataframes (works similar to SQL):
import numpy as np
import pandas
df1 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['foo', 'bar'],
index=list('ABCDEFHIJK')
)
df2 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['bar', 'bax'],
index=list('DEFHIJKLMN')
)
df1[['foo']].join(df2['bar'], how='outer')
The on kwarg takes a list of columns or None. If None, it'll join on the indices of the two dataframes. You just need to make sure that you're using a dataframe for the left size -- hence the double brackets to force df[['foo']] to a dataframe (df['foo'] returns a series)
This gives me:
foo bar
A 4 NaN
B 0 NaN
C 10 NaN
D 8 3
E 2 0
F 3 3
H 9 10
I 0 9
J 5 6
K 2 9
L NaN 3
M NaN 1
N NaN 1
You can also do inner, left, and right joins.
I prefer the explicit way, as presented in your original post, but if you really want to write certain names once, you could try this:
import pandas as pd
import numpy as np
def dictify(*args):
return dict((i,n[i]) for i,n in args)
data = { 'foo': np.random.randn(5) }
other_data = { 'bar': np.random.randn(5) }
print pd.DataFrame(dictify(('foo', data), ('bar', other_data)))
The output is as expected:
bar foo
0 0.533973 -0.477521
1 0.027354 0.974038
2 -0.725991 0.350420
3 1.921215 0.648210
4 0.547640 1.652310
[5 rows x 2 columns]

Categories