How fulfil empy df by FOR loop - python

I need to create a dataframe with two columns: variable, function based on this variable. There is an error in case of next code:
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
k = 0.5**i
test.append(i, k)
print(test)
TypeError: cannot concatenate object of type '<class 'int'>'; only Series and DataFrame objs are valid
What do I need to fix here? Looks like answer is easy, however it is uneasy to find it...
Many thanks for your help

Is there a specific reason you are trying to use the loop? You can create df with column_1 and use Pandas vectorized operations to create column_2
df = pd.DataFrame(np.arange(1,30), columns = ['Column_1'])
df['Column_2'] = 0.5**df['Column_1']
Column_1 Column_2
0 1 0.50000
1 2 0.25000
2 3 0.12500
3 4 0.06250
4 5 0.03125

I like Vaishali's way of approaching it. If you really want to use the for loop, this is how I would of done it:
import pandas as pd
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
test=test.append({'Column_1':i,'Column_2':0.5**i},ignore_index=True)
test = test.round(5)
print(test)
Column_1 Column_2
0 1.0 0.50000
1 2.0 0.25000
2 3.0 0.12500
3 4.0 0.06250
4 5.0 0.03125
5 6.0 0.01562

Related

python Pandas lambda apply doesn't work for NaN

been trying to do an efficient vlookup style on pandas, with IF function...
Basically, I want to apply to this column ccy_grp, that if the value (in a particular row) is 'NaN', it will take the value from another column ccy
def func1(tkn1, tkn2):
if tkn1 == 'NaN:
return tkn2
else:
return tkn1
tmp1_.ccy_grp = tmp1_.apply(lambda x: func1(x.ccy_grp, x.ccy), axis = 1)
but nope, doesn't work. The code cannot seem to detect 'NaN'. I tried another way of np.isnan(tkn1), but I just get a boolean error message...
Any experienced python pandas code developer know?
use pandas.isna to detect a value whether a NaN
generate data
import pandas as pd
import numpy as np
data = pd.DataFrame({'value':[np.NAN, None, 1,2,3],
'label':['str:np.NAN', 'str: None', 'str: 1', 'str: 2', 'str: 3']})
data
create a function
def func1(x):
if pd.isna(x):
return 'is a na'
else:
return f'{x}'
apply function to data
data['func1_result'] = data['value'].apply((lambda x: func1(x)))
data
There is a pandas method for what you are trying to do. Check out combine_first:
Update null elements with value in the same location in ‘other’.
Combine two Series objects by filling null values in one Series with
non-null values from the other Series.
tmp1_.ccy_grp = tmp1_.ccy_grp.combine_first(tmp1_.ccy)
This looks like it should be a pandas mask/where/fillna problem, not an apply:
Given:
value values2
0 NaN 0.0
1 NaN 0.0
2 1.0 1.0
3 2.0 2.0
4 3.0 3.0
Doing:
df.value.fillna(df.values2, inplace=True)
print(df)
# or
df.value.mask(df.value.isna(), df.values2, inplace=True)
print(df)
# or
df.value.where(df.value.notna(), df.values2, inplace=True)
print(df)
Output:
value values2
0 0.0 0.0
1 0.0 0.0
2 1.0 1.0
3 2.0 2.0
4 3.0 3.0

`.iloc()` returns strange results when used with dask dataframe groupby

I have a large dataset with 3 columns:
sku center units
0 103896 1 2.0
1 103896 1 0.0
2 103896 1 5.0
3 103896 1 0.0
4 103896 1 7.0
5 103896 1 0
And I need to use a groupby-apply.
def function_a(x):
return np.sum((x > 0).iloc[::-1].cumsum() == 0)
def function_b(x):
return x.eq(0).sum()/((x.eq(0)&x.shift().ne(0)).sum())
Using dask (df.groupby(['sku', 'center'])['units'].apply(function_a), meta=(float)), I have many problems applying the first function because dask does not support index operations (.iloc), and the results are totally wrong.
Is it possible to apply those function using pyspark UDF ?
Assumptions
Your index (in the above example (0, 1, 2, 3, 4, 5)) corresponds to the correct sorting that you want. E.g. by the data being CSVs of the form
0,103896,1,2.0
1,103896,1,0.0
2,103896,1,5.0
where the first columns corresponds the sample number. When you then read the data with:
import dask.dataframe as dd
df = dd.read_csv('path/to/data_*.csv', header=None)
df.columns = ['id', 'sku', 'center', 'units']
df = df.set_index('id')
this gives you a deterministic DataFrame. Meaning the index of the data is the same, no matter in what order the data is read from the drive.
Solution to .iloc() problem
You can then change function_a(x): to:
def function_a(x):
return np.sum((x.sort_index(ascending=False) > 0).cumsum() == 0)
which should now work with
df.groupby(['sku', 'center'])['units'].apply(function_a, meta=(float))

Conditional counting for group variables

I have the following dataframe:
import pandas as pd
df = pd.DataFrame({"Shop_type": [1,2,3,3,2,3,1,2,1],
"Self_managed" : [True,False,False,True,True,False,False,True,False],
"Support_required" : [True,True,True,False,False,False,False,False,True]})
My goal is to get an overview of the number of number of Self_managed shops and Support_required shops somewhat looking like this:
Shop_type Self_count Supprt_count
0 1 1 2
1 2 2 1
2 3 1 1
Currently I use the following code to achieve this, but it looks very long and unprofessional. Since I am still learning Python, I would like to improve and have more efficient code. Any ideas?
df1 = df[df["Self_managed"] == True]
df1 = df1.groupby(['Shop_type']).size().reset_index(name='Self_count')
df2 = df[df["Support_required"] == True]
df2 = df2.groupby(['Shop_type']).size().reset_index(name='Supprt_count')
df = df1.merge(df2, how = "outer", on="Shop_type")
Seems like you need
df.groupby('Shop_type',as_index=False).sum()
Out[298]:
Shop_type Self_managed Support_required
0 1 1.0 2.0
1 2 2.0 1.0
2 3 1.0 1.0

Set values based on df.query?

I'd like to set the value of a column based on a query. I could probably use .where to accomplish this, but the criteria for .query are strings which are easier for me to maintain, especially when the criteria become complex.
import numpy as np
import pandas as pd
np.random.seed(51723)
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
I'd like to make a new column, d, and set the value to 1 where these criteria are met:
criteria = '(a < b) & (b < c)'
Among other things, I've tried:
df['d'] = np.nan
df.query(criteria).loc[:,'d'] = 1
But that seems to do nothing except giving the SettingWithCopyWarning even though I'm using .loc
And passing inplace like this:
df.query(criteria, inplace=True).loc[:,'d'] = 1
Gives AttributeError: 'NoneType' object has no attribute 'loc'
AFAIK df.query() returns a new DF, so try the following approach:
In [146]: df.loc[df.eval(criteria), 'd'] = 1
In [147]: df
Out[147]:
a b c d
0 0.175155 0.221811 0.808175 1.0
1 0.069033 0.484528 0.841618 1.0
2 0.174685 0.648299 0.904037 1.0
3 0.292404 0.423220 0.897146 1.0
4 0.169869 0.395967 0.590083 1.0
5 0.574394 0.804917 0.746797 NaN
6 0.642173 0.252437 0.847172 NaN
7 0.073629 0.821715 0.859776 1.0
8 0.999789 0.833708 0.230418 NaN
9 0.028163 0.666961 0.582713 NaN

Pandas divide one row by another and output to another row in the same dataframe

For a Dataframe such as:
dt
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
What's the easiest way to append to the same data frame the result of dividing Row1 by Row2? i.e. the desired outcome is:
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
Newrow 2 0.5
Sorry if this is a simple question, I'm slowly getting to grips with pandas from an R background.
Thanks in advance!!!
The code below will create a new row with index d which is formed from dividing rows a and b.
import pandas as pd
df = pd.DataFrame(data={'x':[1,2,3], 'y':[4,5,6]}, index=['a', 'b', 'c'])
df.loc['d'] = df.loc['a'] / df.loc['b']
print(df)
# x y
# a 1.0 4.0
# b 2.0 5.0
# c 3.0 6.0
# d 0.5 0.8
in order to access the first two rows without caring about the index, you can use:
df.loc['newrow'] = df.iloc[0] / df.iloc[1]
then just follow #Ffisegydd's solution...
in addition, if you want to append multiple rows, use the pd.DataFrame.append function.
pandas does all the work row by row. By including another element it also interprets you want a new column:
data['new_row_with_division'] = data['row_name1_values'] / data['row_name2_values']

Categories