Pandas multiple dataframes into one - python

I'm looping through list with multiple dicionaries and want them to be appended into single data frame.
#getting values of specific key from AWS' boto3 response
events_list = response_event.get('Events')
for e in events_list:
df = pd.DataFrame.from_dict(e)
print(df)
Current and expected result below:
col1 col2
0 1 3
col1 col2
0 2 4
col1 col2
0 3 5
col1 col2
0 1 3
1 2 4
2 3 5

Try with concat
out = pd.concat(pd.DataFrame.from_dict(e) for e in events_list)

Related

Fetch row data from known index in pandas

df1:
col1 col2
0 a 5
1 b 2
2 c 1
df2:
col1
0 qa0
1 qa1
2 qa2
3 qa3
4 qa4
5 qa5
final output:
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
Basically , in df1, I have index stored for another df data. I have to fetch data from df2 and append it in df1.
I don't know how to fetch data via index number.
Use Series.map by another Series:
df1['col3'] = df1['col2'].map(df2['col1'])
Or use DataFrame.join with rename column:
df1 = df1.join(df2.rename(columns={'col1':'col3'})['col3'], on='col2')
print (df1)
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
You can use iloc to get data and then to_numpy for values
df1["col3"] = df2.iloc[df1.col2].to_numpy()
df1
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1

Select Multiple Columns in DataFrame Pandas. Slice + Select

I have de DataFrame with almost 100 columns
I need to select col2 to col4 and col54. How can I do it?
I tried:
df = df.loc[:,'col2':col4']
but i can't add col54
You can do this in a couple of different ways:
Using the same format you are currently trying to use, I think doing a join of col54 will be necessary.
df = df.loc[:,'col2':'col4'].join(df.loc[:,'col54'])
.
Another method given that col2 is close to col4 would be to do this
df = df.loc[:,['col2','col3','col4', 'col54']]
or simply
df = df[['col2','col3','col4','col54']]
You can simply do this:
df = df.loc[:,['col2','col4','col54']]
loc takes the column names as list as well.
Or this:
df[['col2','col4','col54']]
You use a list or a pandas.IndexSlice object
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(1,index=[0,1,2],columns=["col1","col2","col3","col4","col5"])
In [3]: df
Out[3]:
col1 col2 col3 col4 col5
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
In [4]: df.loc[:,['col1','col2','col4','col5']]
Out[4]:
col1 col2 col4 col5
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
In [5]: slicer = pd.IndexSlice
In [6]: df.loc[:,slicer["col3":"col5"]]
Out[6]:
col3 col4 col5
0 1 1 1
1 1 1 1
2 1 1 1
edit: I see I misread the OP. This is a bit tough. You can get 'Col2','Col3','Col4' using the pandas.IndexSlice as I demonstrated above. I'm trying to figure out how to include 'Col54' into that.

How to split pandas dataframe into multiple parts based on consecutively occuring values in a column?

I have a dataframe which I am representing in a tabular format below. The original dataframe is a lot bigger in size and therefore I cannot afford to loop on each row.
col1 | col2 | col3
a x 1
b y 1
c z 0
d k 1
e l 1
What I want is split it into subsets of dataframes with consecutive number of 1s in the column col3.
So ideally I want to above dataframe to return two dataframes df1 and df2
df1
col1 | col2 | col3
a x 1
b y 1
df2
col1 | col2 | col3
d k 1
e l 1
Is there an approach like groupby to do this?
If I use groupby it returns me all the 4 rows in a dataframe with col3==1.
I do not want that as I need two dataframes each consisting of consecutively occuring 1s.
One method is to obviously loop by the rows and as and when I find a 0, I can return a dataframe but that is not efficient. Any kind of help is appreciated.
First compare values by 1, then create consecutive groups by shift and cumulative sum and last in list comprehension with groupby get all groups:
m1 = df['col3'].eq(1)
g = m1.ne(m1.shift()).cumsum()
dfs = [x for i, x in df[m1].groupby(g)]
print (dfs)
[ col1 col2 col3
0 a x 1
1 b y 1, col1 col2 col3
3 d k 1
4 e l 1]
print (dfs[0])
col1 col2 col3
0 a x 1
1 b y 1
If also is necessary remove single 1 rows is added Series.duplicated with keep=False:
print (df)
col1 col2 col3
0 a x 1
1 b y 1
2 c z 0
3 d k 1
4 e l 1
5 f m 0
6 g n 1 <- removed
m1 = df['col3'].eq(1)
g = m1.ne(m1.shift()).cumsum()
g = g[g.duplicated(keep=False)]
print (g)
0 1
1 1
3 3
4 3
Name: col3, dtype: int32
dfs = [x for i, x in df[m1].groupby(g)]
print (dfs)
[ col1 col2 col3
0 a x 1
1 b y 1, col1 col2 col3
3 d k 1
4 e l 1]

how to iterate in pandas dataframe columns

i need do some operations with my dataframe
my dataframe is
df = pd.DataFrame(data={'col1':[1,2],'col2':[3,4]})
col1 col2
0 1 3
1 2 4
my operatin is column dependent
for example, i need to add (+) .max() of column to each value in this column
so df.col1.max() is 2 and df.col2.max() is 4
so my output should be:
col1 col2
0 3 7
1 4 8
i have been try this:
for i in df.columns:
df.i += df.i.max()
but
AttributeError: 'DataFrame' object has no attribute 'i'
you can chain df.add and df.max and specify the axis which avoids any loops.
df1 = df.add(df.max(axis=0))
print(df1)
col1 col2
0 3 7
1 4 8
To loop through the columns and add the maximum of each column you can do the following:
for col in df:
df[col] += df[col].max()
This gives
col1 col2
0 3 7
1 4 8

How to do group by and take count of unique and count of some value as aggregate on same column in python pandas?

My question is related to my previous Question but it's different. So I am asking the new question.
In above question see the answer of #jezrael.
df = pd.DataFrame({'col1':[1,1,1],
'col2':[4,4,6],
'col3':[7,7,9],
'col4':[3,3,5]})
print (df)
col1 col2 col3 col4
0 1 4 7 3
1 1 4 7 3
2 1 6 9 5
df1 = df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'})
df1['result_col'] = df1['col3'].div(df1['col4'])
print (df1)
col4 col3 result_col
col1 col2
1 4 1 2 2.0
6 1 1 1.0
Now here I want to take count for the specific value of col4 . Say I also want to take count of col4 == 3 in the same query.
df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'}) ... + count(col4=='3')
How to do this in same above query I have tried bellow but not getting solution.
df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique','col4':'x: lambda x[x == 7].count()'})
Do some preprocessing by including the col4==3 as a column ahead of time. Then use aggregate
df.assign(result_col=df.col4.eq(3).astype(int)).groupby(
['col1', 'col2']
).agg(dict(col3='size', col4='nunique', result_col='sum'))
col3 result_col col4
col1 col2
1 4 2 2 1
6 1 0 1
old answers
g = df.groupby(['col1', 'col2'])
g.agg({'col3':'size','col4': 'nunique'}).assign(
result_col=g.col4.apply(lambda x: x.eq(3).sum()))
col3 col4 result_col
col1 col2
1 4 2 1 2
6 1 1 0
slightly rearranged
g = df.groupby(['col1', 'col2'])
final_df = g.agg({'col3':'size','col4': 'nunique'})
final_df.insert(1, 'result_col', g.col4.apply(lambda x: x.eq(3).sum()))
final_df
col3 result_col col4
col1 col2
1 4 2 2 1
6 1 0 1
I think you need aggregate with list of function in dict for column col4.
If need count 3 values the simpliest is sum True values in x == 3:
df1 = df.groupby(['col1','col2'])
.agg({'col3':'size','col4': ['nunique', lambda x: (x == 3).sum()]})
df1 = df1.rename(columns={'<lambda>':'count_3'})
df1.columns = ['{}_{}'.format(x[0], x[1]) for x in df1.columns]
print (df1)
col4_nunique col4_count_3 col3_size
col1 col2
1 4 1 2 2
6 1 0 1

Categories