I'm trying to create a set of dataframes from one big dataframe. Theses dataframes consists of the columns of the original dataframe in this manner:
1st dataframe is the 1st column of the original one,
2nd dataframe is the 1st and 2nd columns of the original one,
and so on.
I use this code to iterate over the dataframe:
for i, data in enumerate(x):
data = x.iloc[:,:i]
print(data)
This works but I also get an empty dataframe in the beginning and an index vector I don't need.
any suggestions on how to remove those 2?
thanks
Instead of enumerating the dataframe, since you are not using the outcome after enumerating but using only the index value, you can just iterate in the range 1 through the number of columns added one, then take the slice df.iloc[:, :i] for each value of i, you can use list-comprehension to achieve this.
>>> [df.iloc[:, :i] for i in range(1,df.shape[1]+1)]
[ A
0 1
1 2
2 3,
A B
0 1 2
1 2 4
2 3 6]
The equivalent traditional loop would look something like this:
for i in range(1,df.shape[1]+1):
print(df.iloc[:, :i])
A
0 1
1 2
2 3
A B
0 1 2
1 2 4
2 3 6
you can also do something like this:
data = {
'col_1': np.random.randint(0, 10, 5),
'col_2': np.random.randint(10, 20, 5),
'col_3': np.random.randint(0, 10, 5),
'col_4': np.random.randint(10, 20, 5),
}
df = pd.DataFrame(data)
all_df = {col: df.iloc[:, :i] for i, col in enumerate(df, start=1)}
# For example we can print the last one
print(all_df['col_4'])
col_1 col_2 col_3 col_4
0 1 13 5 10
1 8 16 1 18
2 6 11 5 18
3 3 11 1 10
4 7 14 8 12
Related
My data is like this:
df = pd.DataFrame({'a': [5,0,0, 6, 0, 0, 0 , 12]})
I want to count the zeros above the 6 and replace them with (6/count+1)=(6/3)=2 (I will also replace the original 6)
I also want to do a similar thing with the zeros above the 12.
So, (12/count)=(12/3)=4
So the final result will be:
[5,2,2, 2, 3, 3, 3 , 3]
I am not sure how to start. Are there any functions that do this?
Thanks.
Use GroupBy.transform with mean and custom groups created with test not equal 0, swap order, cumulative sum and swap order to original:
g = df['a'].ne(0).iloc[::-1].cumsum().iloc[::-1]
df['b'] = df.groupby(g)['a'].transform('mean')
print (df)
a b
0 5 5
1 0 2
2 0 2
3 6 2
4 0 3
5 0 3
6 0 3
7 12 3
I hope someone could help me solve my issue.
Given a pandas dataframe as depicted in the image below,
I would like to re-arrange it into a new dataframe, combining several sets of columns (the sets have all the same size) such that each set becomes a single column as shown in the desired result image below.
Thank you in advance for any tips.
For a general solution, you can try one of this two options:
You could try this, using OrderedDict to get the alpha-nonnumeric column names ordered alphabetically, pd.DataFrame.filter to filter the columns with similar names, and then concat the values with pd.DataFrame.stack:
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame([[0,1,2,3,4],[5,6,7,8,9]], columns=['a1','a2','b1','b2','c'])
newdf=pd.DataFrame()
for col in list(OrderedDict.fromkeys( ''.join(df.columns)).keys()):
if col.isalpha():
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reset_index(drop=True)
Output:
df
a1 a2 b1 b2 c
0 0 1 2 3 4
1 5 6 7 8 9
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Another way to get the column names could be using re and set like this, and then sort columns alphabetically:
newdf=pd.DataFrame()
import re
for col in set(re.findall('[^\W\d_]',''.join(df.columns))):
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reindex(sorted(newdf.columns), axis=1).reset_index(drop=True)
Output:
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
You can do this with pd.wide_to_long and rename the 'c' column:
df_out = pd.wide_to_long(df.reset_index().rename(columns={'c':'c1'}),
['a','b','c'],'index','no')
df_out = df_out.reset_index(drop=True).ffill().astype(int)
df_out
Output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Same dataframe just sorting is different.
pd.wide_to_long(df, ['a','b'], 'c', 'no').reset_index().drop('no', axis=1)
Output:
c a b
0 4 0 2
1 9 5 7
2 4 1 3
3 9 6 8
The fact that column c only had one columns versus other letters having two columns, made it kind of tricky. I first stacked the dataframe and got rid of the numbers in the column names. Then for a and b I pivoted a dataframe and removed all nans. For c, I multiplied the length of the dataframe by 2 to make it match a and b and then merged it in with a and b.
input:
import pandas as pd
df = pd.DataFrame({'a1': {0: 0, 1: 5},
'a2': {0: 1, 1: 6},
'b1': {0: 2, 1: 7},
'b2': {0: 3, 1: 8},
'c': {0: 4, 1: 9}})
df
code:
df1=df.copy().stack().reset_index().replace('[0-9]+', '', regex=True)
dfab = df1[df1['level_1'].isin(['a','b'])].pivot(index=0, columns='level_1', values=0) \
.apply(lambda x: pd.Series(x.dropna().values)).astype(int)
dfc = pd.DataFrame(np.repeat(df['c'].values,2,axis=0)).rename({0:'c'}, axis=1)
df2=pd.merge(dfab, dfc, how='left', left_index=True, right_index=True)
df2
output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
I'm wondering what is the pythonic way of achieving the following:
Given a list of list:
l = [[1, 2],[3, 4],[5, 6],[7, 8]]
I would like to create a list of pandas data frames where the first pandas data frame is a row bind of the first two elements in l and the second a row bind of the last two elements:
>>> df1 = pd.DataFrame(np.asarray(l[:2]))
>>> df1
0 1
0 1 2
1 3 4
and
>>> df2 = pd.DataFrame(np.asarray(l[2:]))
>>> df2
0 1
0 5 6
1 7 8
In my problem I have a very long list and I know the grouping, i.e. the first k elements of the list l should be rowbinded to form the first df. How can this be achieved in a python friendly way?
You could store them in dict like
In [586]: s = pd.Series(l)
In [587]: k = 2
In [588]: df = {k:pd.DataFrame(g.values.tolist()) for k, g in s.groupby(s.index//k)}
In [589]: df[0]
Out[589]:
0 1
0 1 2
1 3 4
In [590]: df[1]
Out[590]:
0 1
0 5 6
1 7 8
I have a dataset given below:
a,b,c
1,1,1
1,1,1
1,1,2
2,1,2
2,1,1
2,2,1
I created crosstab with pandas:
cross_tab = pd.crosstab(index=a, columns=[b, c], rownames=['a'], colnames=['b', 'c'])
my crosstab is given as an output:
b 1 2
c 1 2 1
a
1 2 1 0
2 1 1 1
I want to iterate over this crosstab for given each a,b and c values. How can I get values such as cross_tab[a=1][b=1, c=1]? Thank you.
You can use slicers:
a,b,c = 1,1,1
idx = pd.IndexSlice
print (cross_tab.loc[a, idx[b,c]])
2
You can also reshape df by DataFrame.unstack, reorder_levels and then use loc:
a = cross_tab.unstack().reorder_levels(('a','b','c'))
print (a)
a b c
1 1 1 2
2 1 1 1
1 1 2 1
2 1 2 1
1 2 1 0
2 2 1 1
dtype: int64
print (a.loc[1,1,1])
2
You are looking for df2.xxx.get_level_values:
In [777]: cross_tab.loc[cross_tab.index.get_level_values('a') == 1,\
(cross_tab.columns.get_level_values('b') == 1)\
& (cross_tab.columns.get_level_values('c') == 1)]
Out[777]:
b 1
c 1
a
1 2
Another way to consider, albeit at loss of a little bit of readability, might be to simply use the .loc to navigate the hierarchical index generated by pandas.crosstab. Following example illustrates it:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(
{
"a": np.random.choice([1, 2], 5, replace=True),
"b": np.random.choice([11, 12, 13], 5, replace=True),
"c": np.random.choice([21, 22, 23], 5, replace=True),
}
)
df
Output
a b c
0 2 11 23
1 2 11 23
2 1 12 23
3 2 12 21
4 1 12 21
crosstab output is:
cross_tab = pd.crosstab(
index=df.a, columns=[df.b, df.c], rownames=["a"], colnames=["b", "c"]
)
cross_tab
b 11 12
c 23 21 23
a
1 0 1 1
2 2 1 0
Now let's say you want to access value when a==2, b==11 and c==23, then simply do
cross_tab.loc[2].loc[11].loc[23]
2
Why does this work? .loc allows one to select by index labels. In the dataframe output by crosstab, our erstwhile column values now become index labels. Thus, with every .loc selection we do, it gives the slice of the dataframe corresponding to that index label. Let's navigate cross_tab.loc[2].loc[11].loc[23] step by step:
cross_tab.loc[2]
yields:
b c
11 23 2
12 21 1
23 0
Name: 2, dtype: int64
Next one:
cross_tab.loc[2].loc[11]
Yields:
c
23 2
Name: 2, dtype: int64
And finally we have
cross_tab.loc[2].loc[11].loc[23]
which yields:
2
Why do I say that this reduces the readability a bit? Because to understand this selection you have to be aware of how the crosstab was created, i.e. rows are a and columns were in the order [b, c]. You have to know that to be able to interpret what cross_tab.loc[2].loc[11].loc[23] would do. But I have found that often to be a good tradeoff.
Let's say I have the following DataFrame:
d = pd.DataFrame({ 'a': [10,20,30], 'b': [1,2,3] })
a b
0 10 1
1 20 2
2 30 3
I want to create a new column 'c' that will contain a tuple of 'a' and 'b' (per row). Something like this:
a b c
0 10 1 (10,1)
1 20 2 (20,2)
2 30 3 (30,3)
I just can't make it, no matter what I try (I tried apply with axis=1 and have it return a tuple, a list, a Series object.. neither worked).
I saw that I can create a DataFrame and set the dtype to 'object' and then I can put tuples in a cell. How do I do it with apply?
What I'm trying to do is to count distinct combinations of a and b, get the most common and print a summary with some data on them (data comes from other columns, let's say 'd' and 'e').
Is there any more elegant way to do it?
You could do it using zip:
>>> df = pd.DataFrame({'a': [10,20,30], 'b': [1,2,3]})
>>> df["c"] = zip(df["a"], df["b"])
>>> df
a b c
0 10 1 (10, 1)
1 20 2 (20, 2)
2 30 3 (30, 3)
[3 rows x 3 columns]
but usually putting a tuple in a column is the wrong way to go because pandas can't really do anything else with it at that point. If you want to count distinct combinations of a and b and do something with the associated groups of rows, you should use groupby instead:
>>> df = pd.DataFrame({'a': [10,20,30,20,30], 'b': [1,2,3,2,1]})
>>> df
a b
0 10 1
1 20 2
2 30 3
3 20 2
4 30 1
[5 rows x 2 columns]
>>> df_counts = df.groupby(["a", "b"]).size()
>>> df_counts.sort(ascending=False)
>>> df_counts
a b
20 2 2
30 3 1
1 1
10 1 1
dtype: int64
"Print a summary with some data on them" is too broad to say anything useful about, but you can use groupby to perform all sorts of summary operations on the groups.