Can anyone suggest me an efficient way to convert list of list of dictionaries as pandas dataframe?
Input = [[{'name':'tom','roll_no':1234,'gender':'male'},
{'name':'sam','roll_no':1212,'gender':'male'}],
[{'name':'kavi','roll_no':1235,'gender':'female'},
{'name':'maha','roll_no':1211,'gender':'female'}]]
The dictionary keys are same in the sample input provided and an expected output is,
Output = name roll_no gender
0 tom 1234 male
1 sam 1212 male
2 kavi 1235 female
3 maha 1211 female
You will need to flatten your input using itertools.chain, and you can then call the pd.DataFrame constructor.
from itertools import chain
pd.DataFrame(list(chain.from_iterable(data)))
gender name roll_no
0 male tom 1234
1 male sam 1212
2 female kavi 1235
3 female maha 1211
Related
I have currently a pandas dataframe with 100+ columns, that was achieved from pd.normalize_json() and there is one particular column (children) that looks something like this:
name age children address... 100 more columns
Mathew 20 [{name: Sam, age:5}, {name:Ben, age: 10}] UK
Linda 30 [] USA
What I would like for the dataframe to look like is:
name age children.name children.age address... 100 more columns
Mathew 20 Sam 5 UK
Mathew 20 Ben 10 UK
Linda 30 USA
There can be any number of dictionaries within the list. Thanks for the help in advance!
I have a problem calculating percent within a dataframe.
I have the following dataframe called dfGender:
age gender impressions
0 13-17 female 234561
1 13-17 male 34574
2 25-34 female 120665
3 25-34 male 234560
4 35-44 female 5134
5 35-44 male 2405
6 45-54 female 423
7 45-54 male 324
Now I would like to make to have list of the total percent for all female and male impressions like this: [female%, male%].
My idea is to pivot_table with the following code:
df_genderSum = dfGender.pivot_table(columns='gender', values='impressions', aggfunc='sum')
Then calculating the total of them all:
df_genderSum['total'] = df_genderSum.sum(axis=1)
Then after this making the percent calculations through:
df_genderSum['female%'] = (df_genderSum['female']/df_genderSum['total'])*100
df_genderSum['male%'] = (df_genderSum['male']/df_genderSum['total'])*100
Now this gives me the desired correct calculations, altough I think it's a really messy code.
I have 2 questions:
1: Is there a simpler way to do this, where you get a dataframe only existing of:
gender female% male%
impressions "number" "number"
2: How do i make it to a list. I was thinking of the following code.
list = df_genderSum.reset_index().values.tolist()
Any help is appreciated!
You can try:
df.groupby('gender')['impressions'].apply(lambda x : (sum(x)/sum(df['impressions'])*100))
gender
female 57.0276
male 42.9724
and
df.groupby('gender')['impressions'].apply(lambda x : (sum(x)/sum(df['impressions'])*100)).to_list()
[57.02762682448004, 42.972373175519957]
If you want the exact dataframe that you asked for, save the above as "s" and do the following:
s=df.groupby('gender')['impressions'].apply(lambda x : (sum(x)/sum(df['impressions'])*100))
pd.DataFrame(s).T
gender female male
impressions 57.027627 42.972373
Here you go:
df_agg = df.drop(['age'], axis=1).groupby('gender').sum()
print(df_agg['impressions']/df_agg['impressions'].sum()*100)
Prints (can be different based on your data):
F 71.428571
M 28.571429
Name: impressions, dtype: float64
df_genderSum = df_gender.groupby('gender')['impressions'].sum() # same result as pivot_table
df_genderSum /= df_genderSum.sum() # percentage, inplace
# now it is a series, reshape as needed
df_genderSum = df_genderSum.to_frame().T
you can try this one :
(df.groupby('gender').sum()['impressions']/df['impressions'].sum()).to_frame(name = 'impressions').T
In my data frame, I have column 'countries', I am trying to change that column values into 'developed countries' and 'developing countries'. My data frame is as following:
countries age gender
1 India 21 Male
2 China 22 Female
3 USA 23 Male
4 UK 25 Male
I have following two arrays:
developed = ['USA','UK']
developing = ['India', 'China']
I want to convert array into following data frame:
countries age gender
1 developing 21 Male
2 developing 22 Female
3 developed 23 Male
4 developed 25 Male
I tried following code, but I got 'SettingWithCopyWarning' error:
df[df['countries'].isin(developed)]['countries'] = 'developed'
I tried following code, but I got 'SettingWithCopyWarning' error and my jupyter notebook got hanged:
for i, x in enumerate(df['countries']):
if x in developed:
df['countries'][i] = 'developed'
Is their alternative way to change column categories??
use np.where:
import numpy as np
df['countries']=np.where(df['countries'].isin(developed),'developed','developing')
print(df)
countries age gender
1 developing 21 Male
2 developing 22 Female
3 developed 23 Male
4 developed 25 Male
Also you can use DataFrame.loc:
c=df['countries'].isin(developed)
df.loc[c,'countries']='developed'
df.loc[~c,'countries']='developing'
print(df)
countries age gender
1 developing 21 Male
2 developing 22 Female
3 developed 23 Male
4 developed 25 Male
You can try implementing a replace function, it wouldn't give an error.
Updated_DataSet1 = data_set.replace("India", "Developing")
Updated_DataSet2 = Updated_DataSet1.replace("China","Developing")
I am learning Python and I have a question related to creating a data frame for every 5 rows, transpose and merge the data frames.
I have a .txt file with the following input. It has thousands of rows and I need to go through each line until the end of the file.
Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle
I need to get this as my output:
Name,Age,Sex,Company,Vehicle
Kamath,23,Male,ACC,Car
Ram,32,Male,CCA,Bike
Reena,26,Female,BARC,Cycle
Use read_csv for DataFrame and then pivot with cumcount for counter for new index:
import pandas as pd
temp=u"""Name,Kamath
Age,23
Sex,Male
Company,ACC
Vehicle,Car
Name,Ram
Age,32
Sex,Male
Company,CCA
Vehicle,Bike
Name,Reena
Age,26
Sex,Female
Company,BARC
Vehicle,Cycle"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.txt'
df = pd.read_csv(pd.compat.StringIO(temp), names=['a','b'])
print (df)
a b
0 Name Kamath
1 Age 23
2 Sex Male
3 Company ACC
4 Vehicle Car
5 Name Ram
6 Age 32
7 Sex Male
8 Company CCA
9 Vehicle Bike
10 Name Reena
11 Age 26
12 Sex Female
13 Company BARC
14 Vehicle Cycle
df = pd.pivot(index=df.groupby('a').cumcount(),
columns=df['a'],
values=df['b'])
print (df)
a Age Company Name Sex Vehicle
0 23 ACC Kamath Male Car
1 32 CCA Ram Male Bike
2 26 BARC Reena Female Cycle
Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.