I am fairly new to Python and trying to figure out how to generate dataframes for multiple arrays. I have a list where the arrays are currently stored:
list = [ [1 2 3 4], [12 19 30 60 95 102] ]
What I want to do is take each array from this list and put them into separate dataframes, with the array contents populating a column of the dataframe like so:
Array2_df
1 12
2 19
3 30
4 60
I have found several answers involving the use of dictionaries, but am not sure how that would actually solve my problem... I also don't understand how naming the dataframes dynamically would work. I have tried playing around with for loops, but that just overwrote the same dataframe repeatedly. Please help!! Thanks :)
As mentioned in the comments, dynamically created variables is a bad idea. Why not use a single dataframe, like so:
In [1]: zlist = [[1, 2, 3, 4], [12, 19, 30, 60, 95, 102], [1, 2, 4, 5, 1, 6, 1, 7, 8, 21]]
In [2]: pd.DataFrame({f"array_{i}": pd.Series(z) for i, z in enumerate(zlist)})
Out[2]:
array_0 array_1 array_2
0 1.0 12.0 1
1 2.0 19.0 2
2 3.0 30.0 4
3 4.0 60.0 5
4 NaN 95.0 1
5 NaN 102.0 6
6 NaN NaN 1
7 NaN NaN 7
8 NaN NaN 8
9 NaN NaN 21
If you really insist on separate dataframes, then you should store them in a dictionary:
df_dict = {f"array_{i}": pd.DataFrame({f"array_{i}": z}) for i, z in enumerate(zlist)}
Then, you can access a specific dataframe by name:
In [8]: df_dict["array_2"]
Out[8]:
array_2
0 1
1 2
2 4
3 5
4 1
5 6
6 1
7 7
8 8
9 21
Related
I am searching for an efficient way to set a new column based on values from previous rows from different columns. Imagine you have this DataFrame:
pd.DataFrame([[0, 22], [1, 15], [2, 18], [3, 9], [4, 10], [6, 11], [8, 12]],
columns=['days', 'quantity'])
days quantity
0 0 22
1 1 15
2 2 18
3 3 9
4 4 10
5 6 11
6 8 12
Now, I want to have a third column 'quantity_3days_ago', like this:
days quantity quantity_3days_ago
0 0 22 NaN
1 1 15 NaN
2 2 18 NaN
3 3 9 22
4 4 10 15
5 6 11 9
6 8 12 10
So I need to use the 'days' column to check what the 'quantity' column says for 3 days ago. In case there is no exact value in the 'days' column I want 'quantity_3days_ago' to be the value of the row before. See the last row as an example: 8 - 3 would be 5 in which case I would take the 'quantity' value of the row with days equals 4 for the 'quantity_3days_ago'. I hope this is understandable. I tried using rolling windows and shifting, but I wasn't able to get the desired result. I guess it would probably be possible with a loop over the whole DataFrame. However, this would be rather inefficient. I wonder if this can be done in one line. Thanks for your help!
We can do reindex before shift
rng = range(df.days.iloc[0],df.days.iloc[-1]+1)
df['new'] = df.days.map(df.set_index('days').reindex(rng ,method='ffill')['quantity'].shift(3))
df
Out[125]:
days quantity new
0 0 22 NaN
1 1 15 NaN
2 2 18 NaN
3 3 9 22.0
4 4 10 15.0
5 6 11 9.0
6 8 12 10.0
Consider this simple example
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})
I am trying to perform a rolling regression of a on b. I am trying to use the simplest pandas tool available: apply. I want to use apply because I want to keep the flexibility of returning any parameter of the regression.
However, the simple code below does not work
df.rolling(10).apply(lambda x: smf.ols('a ~ b', data = x).fit())
File "<string>", line 1, in <module>
PatsyError: Error evaluating factor: NameError: name 'b' is not defined
a ~ b
^
What is the issue?
Thanks!
rolling apply is not capable of interacting with multiple columns simultaneously, nor is it able to produce non-numeric values. We instead need to take advantage of the iterable nature of rolling objects. We also need to account for handling min_periods ourselves, since the iterable rolling object generates all windows results regardless of other rolling arguments.
We can then create some function to produce each row in the results from the regression results to do something like:
def process(x):
if len(x) >= 10:
reg = smf.ols('a ~ b', data=x).fit()
print(reg.params)
return [
# b from params
reg.params['b'],
# b from tvalues
reg.tvalues['b'],
# Both lower and upper b from conf_int()
*reg.conf_int().loc['b', :].tolist()
]
# Return NaN in the same dimension as the results
return [np.nan] * 4
df = df.join(
# join new DataFrame back to original
pd.DataFrame(
(process(x) for x in df.rolling(10)),
columns=['coef', 't', 'lower', 'upper']
)
)
df:
a b coef t lower upper
0 1 3 NaN NaN NaN NaN
1 3 5 NaN NaN NaN NaN
2 5 6 NaN NaN NaN NaN
3 7 2 NaN NaN NaN NaN
4 4 4 NaN NaN NaN NaN
5 5 6 NaN NaN NaN NaN
6 6 2 NaN NaN NaN NaN
7 4 5 NaN NaN NaN NaN
8 7 7 NaN NaN NaN NaN
9 8 1 -0.216802 -0.602168 -1.047047 0.613442
10 9 9 0.042781 0.156592 -0.587217 0.672778
11 1 5 0.032086 0.097763 -0.724742 0.788913
12 3 3 0.113475 0.329006 -0.681872 0.908822
13 5 2 0.198582 0.600297 -0.564258 0.961421
14 7 5 0.203540 0.611002 -0.564646 0.971726
15 4 4 0.236599 0.686744 -0.557872 1.031069
16 5 3 0.293651 0.835945 -0.516403 1.103704
17 6 6 0.314286 0.936382 -0.459698 1.088269
18 4 4 0.276316 0.760812 -0.561191 1.113823
19 7 1 0.346491 1.028220 -0.430590 1.123572
20 8 1 -0.492424 -1.234601 -1.412181 0.427332
21 9 9 0.235075 0.879433 -0.381326 0.851476
Setup:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({
'a': [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9, 1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9],
'b': [3, 5, 6, 2, 4, 6, 2, 5, 7, 1, 9, 5, 3, 2, 5, 4, 3, 6, 4, 1, 1, 9]
})
Rolling.apply applies the rolling operation to each column separately (Related question).
Following user3226167's answer of this thread, it seems that easiest way to accomplish what you want is to use RollingOLS.from_formula from statsmodels.regression.rolling.
from statsmodels.regression.rolling import RollingOLS
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})
model = RollingOLS.from_formula('a ~ b', data = df, window=10)
reg_obj = model.fit()
# estimated coefficient
b_coeff = reg_obj.params['b'].rename('coef')
# b t-value
b_t_val = reg_obj.tvalues['b'].rename('t')
# 95 % confidence interval of b
b_conf_int = reg_obj.conf_int(cols=[1]).droplevel(level=0, axis=1)
# join all the desired information to the original df
df = df.join([b_coeff, b_t_val, b_conf_int])
where reg_obj is a RollingRegressionResults which holds lots of information about the regression (see all its different attributes in the docs)
Output
>>> type(reg_obj)
<class 'statsmodels.regression.rolling.RollingRegressionResults'>
>>> df
a b coef t lower upper
0 1 3 NaN NaN NaN NaN
1 3 5 NaN NaN NaN NaN
2 5 6 NaN NaN NaN NaN
3 7 2 NaN NaN NaN NaN
4 4 4 NaN NaN NaN NaN
5 5 6 NaN NaN NaN NaN
6 6 2 NaN NaN NaN NaN
7 4 5 NaN NaN NaN NaN
8 7 7 NaN NaN NaN NaN
9 8 1 -0.216802 -0.602168 -0.922460 0.488856
10 9 9 0.042781 0.156592 -0.492679 0.578240
11 1 5 0.032086 0.097763 -0.611172 0.675343
12 3 3 0.113475 0.329006 -0.562521 0.789472
13 5 2 0.198582 0.600297 -0.449786 0.846949
14 7 5 0.203540 0.611002 -0.449372 0.856452
15 4 4 0.236599 0.686744 -0.438653 0.911851
16 5 3 0.293651 0.835945 -0.394846 0.982147
17 6 6 0.314286 0.936382 -0.343553 0.972125
18 4 4 0.276316 0.760812 -0.435514 0.988146
19 7 1 0.346491 1.028220 -0.313981 1.006963
20 8 1 -0.492424 -1.234601 -1.274162 0.289313
21 9 9 0.235075 0.879433 -0.288829 0.758978
I'm chunking a list into smaller lists of size n and trying to add each new list to a DataFrame. When I list the lists, all of the data is there; when i try to put the lists in a DataFrame the first list of the set disappears.
my_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
def divide_chunks(a,n):
for i in range(0, len(a),n):
yield a[i:i+n]
x = divide_chunks(my_list, n)
for i in x:
print(i)
gives me
[1, 2, 3, 4, 5]
[6, 7, 8, 9, 10]
[11, 12, 13, 14, 15]
[16, 17, 18, 19, 20]
I would like to put this into a DataFrame.
Here is how I'm trying to that
x = divide_chunks(my_list, n)
for i in x:
emptydf = pd.DataFrame(x)
emptydf
I would expect the output to be like above but instead I'm missing the list that has 1:5
{0} {1} {2} {3} {4}
{0} 6 7 8 9 10
{1} 11 12 13 14 15
{2} 16 17 18 19 20
Your code is not doing what you think it does:
x = divide_chunks(my_list, 4)
print(x)
Will return an object like such:
<generator object divide_chunks at 0x2aaae0622e60>
Now you can directly use:
pd.DataFrame(x)
0 1 2 3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
This can be done with np.array_split. Here I added an extra value to show how it behaves with an uneven division
import pandas as pd
import numpy as np
my_list = [*range(1, 22)]
N = 5
pd.DataFrame(np.array_split(my_list, range(N, len(my_list), N)))
# 0 1 2 3 4
#0 1 2.0 3.0 4.0 5.0
#1 6 7.0 8.0 9.0 10.0
#2 11 12.0 13.0 14.0 15.0
#3 16 17.0 18.0 19.0 20.0
#4 21 NaN NaN NaN NaN
I'm attempting to populate a column in a data frame based on whether the index value of that record falls within a range defined by two columns in another data frame.
df1 looks like:
a
0 4
1 45
2 7
3 5
4 48
5 44
6 22
7 89
8 45
9 44
10 23
and df2 is:
START STOP CLASS
0 2 3 1
1 5 7 2
2 8 8 3
what I want would look like:
a CLASS
0 4 nan
1 45 nan
2 7 1
3 5 1
4 48 nan
5 44 2
6 22 2
7 89 2
8 45 3
9 44 nan
10 23 nan
The START column in df2 is the minimum value of the range and the STOP column is the max.
You can use IntervalIndex (requires v0.20.0).
First construct the index:
df2.index = pd.IntervalIndex.from_arrays(df2['START'], df2['STOP'], closed='both')
df2
Out:
START STOP CLASS
[2, 3] 2 3 1
[5, 7] 5 7 2
[8, 8] 8 8 3
Now if you index into the second DataFrame it will lookup the value in the intervals. For example,
df2.loc[6]
Out:
START 5
STOP 7
CLASS 2
Name: [5, 7], dtype: int64
returns the second class. I don't know if it can be used with merge or with merge_asof but as an alternative you can use map:
df1['CLASS'] = df1.index.to_series().map(df2['CLASS'])
Note that I first converted the index to a Series to be able to use the Series.map method. This results in
df1
Out:
a CLASS
0 4 NaN
1 45 NaN
2 7 1.0
3 5 1.0
4 48 NaN
5 44 2.0
6 22 2.0
7 89 2.0
8 45 3.0
9 44 NaN
10 23 NaN
Alternative solution:
classdict = df2.set_index("CLASS").to_dict("index")
rangedict = {}
for key,value in classdict.items():
# get all items in range and assign value (the key)
for item in list(range(value["START"],value["STOP"]+1)):
rangedict[item] = key
extract rangedict:
{2: 1, 3: 1, 5: 2, 6: 2, 7: 2, 8: 3}
now map and possibly format(?):
df1['CLASS'] = df1.index.to_series().map(rangedict)
df1.applymap("{0:.0f}".format)
outputs:
a CLASS
0 4 nan
1 45 nan
2 7 1
3 5 1
4 48 nan
5 44 2
6 22 2
7 89 2
8 45 3
9 44 nan
10 23 nan
import pandas as pd
import numpy as np
# Here is your existing dataframe
df_existing = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Create a new empty dataframe with specific column names and data types
df_new = pd.DataFrame(index=None)
columns = ['field01','field02','field03','field04']
dtypes = [str,int,int,int]
for c,d in zip(columns, dtypes):
df_new[c] = pd.Series(dtype=d)
# Set the index on the new dataframe to same as existing
df_new['new_index'] = df_existing.index
df_new.set_index('new_index', inplace=True)
# Fill the new dataframe with specific fields from the existing dataframe
df_new[['field02','field03']] = df_existing[['B','C']]
print df_new
I am struggeling to get the right (restricted to the selection) index when using the methode xs by pandas to select specific data in my dataframe. Let me demonstrate, what I am doing:
print(df)
value
idx1 idx2 idx3 idx4 idx5
10 2.0 0.0010 1 2 6.0 ...
2 3 6.0 ...
...
7 8 6.0 ...
8 9 6.0 ...
20 2.0 0.0010 1 2 6.0 ...
2 3 6.0 ...
...
18 19 6.0 ...
19 20 6.0 ...
# get dataframe for idx1 = 10, idx2 = 2.0, idx3 = 0.0010
print(df.xs([10,2.0,0.0010]))
value
idx4 idx5
1 2 6.0 ...
2 3 6.0 ...
3 4 6.0 ...
4 5 6.0 ...
5 6 6.0 ...
6 7 6.0 ...
7 8 6.0 ...
8 9 6.0 ...
# get the first index list of this part of the dataframe
print(df.xs([10,2.0,0.0010]).index.levels[0])
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19]
So I do not understand, why the full list of values that occur in idx4 is returned even though we restricted the dataframe to a part where idx4 only takes values from 1 to 8. Is it that I use the index method in a wrong way?
This is a known feature not bug. pandas preserves all of the index information. You can determine which of the levels are expressed and at what location via the labels attribute.
If you are looking to create an index that is fresh and just contains the information relevant to the slice you just made, you can do this:
df_new = df.xs([10,2.0,0.0010])
idx_new = pd.MultiIndex.from_tuples(df_new.index.to_series(),
names=df_new.index.names)
df_new.index = idx_new