flatten dataframe containing list with multiple dictionaries - python

I have currently a pandas dataframe with 100+ columns, that was achieved from pd.normalize_json() and there is one particular column (children) that looks something like this:
name age children address... 100 more columns
Mathew 20 [{name: Sam, age:5}, {name:Ben, age: 10}] UK
Linda 30 [] USA
What I would like for the dataframe to look like is:
name age children.name children.age address... 100 more columns
Mathew 20 Sam 5 UK
Mathew 20 Ben 10 UK
Linda 30 USA
There can be any number of dictionaries within the list. Thanks for the help in advance!

Related

Add column to DataFrame and assign number to each row

I have the following table
Father
Son
Year
James
Harry
1999
James
Alfi
2001
Corey
Kyle
2003
I would like to add a fourth column that makes the table look like below. It's supposed to show which child of each father was born first, second, third, and so on. How can I do that?
Father
Son
Year
Child
James
Harry
1999
1
James
Alfi
2001
2
Corey
Kyle
2003
1
here is one way to do it. using cumcount
# groupby Father and take a cumcount, offsetted by 1
df['Child']=df.groupby(['Father'])['Son'].cumcount()+1
df
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
it assumes that DF is sorted by Father and Year. if not, then
df['Child']=df.sort_values(['Father','Year']).groupby(['Father'] )['Son'].cumcount()+1
df
Here is an idea of solving this using groupby and cumsum functions.
This assumes that the rows are ordered so that the younger sibling is always below their elder brother and all children of the same father are in a continuous pack of rows.
Assume we have the following setup
import pandas as pd
df = pd.DataFrame({'Father': ['James', 'James', 'Corey'],
'Son': ['Harry', 'Alfi', 'Kyle'],
'Year': [1999, 2001, 2003]})
then here is the trick we group the siblings with the same father into a groupby object and then compute the cumulative sum of ones to assign a sequential number to each row.
df['temp_column'] = 1
df['Child'] = df.groupby('Father')['temp_column'].cumsum()
df.drop(columns='temp_column')
The result would look like this
Father Son Year Child
0 James Harry 1999 1
1 James Alfi 2001 2
2 Corey Kyle 2003 1
Now to make the solution more general consider reordering the rows to satisfy the preconditions before applying the solution and then if necessary restore the dataframe to the original order.

How to convert a string size 1 into Dataframe?

Just get back into coding. But came across this issue.
How do I get a 1 string into a dataframe where it sorts every five lines into a column.
The string show
"Jane Doe
Male-52
City- NYC
$36,000
total salary
Amy sam
Female-65
City- NYC
$38,000
total salary
.....
.....
and so on
"
How do I get it to be a data frame where I can put it into
Name Sex age City Total Salary
Jane Doe Male 52 NYC 36,000
Amy Sam Female 65 NYC 38,000
......
My code is
elements = driver.find_elements_by_xpath("""//*[#id="file"]""")
data = "".join([element.text for element in elements])
import pandas
s = """Jane Doe
Male-52
City- NYC
$36,000
total salary
Amy sam
Female-65
City- NYC
$38,000
total salary"""
import re
df = pandas.DataFrame(re.findall("(\w+ \w+)\n(\w+)-(\d+)\nCity- (\w+)\n\$(.*)",s),
columns=["name","sex","age","city","salary"])
print(df)
is one way to solve this ...
This should work for n number of columns - you would just have to pass in appropriate column names to dataframe afterwards. You will also have to clean up the columns and delete unnecessary ones after the reshaping is done
Edited to include the entire code and output
import pandas as pd
mystr = """Jane Doe
Male-52
City- NYC
$36,000
total salary
Amy sam
Female-65
City- NYC
$38,000
total salary"""
num_columns = 5
df = pd.Series(mystr.split("\n"), name="data")
pd.DataFrame(df.values.reshape((int(df.shape[0]/num_columns), num_columns)))
output image

Turn repeat entries associated with different values into one entry with list of those values? [duplicate]

This question already has answers here:
How to group dataframe rows into list in pandas groupby
(17 answers)
Closed 3 years ago.
I wasn't sure how to title this.
Assume the following Pandas DataFrame:
Student ID Class
1 John 99124 Biology
2 John 99124 History
3 John 99124 Geometry
4 Sarah 74323 Physics
5 Sarah 74323 Geography
6 Sarah 74323 Algebra
7 Alex 80045 Trigonometry
8 Alex 80045 Economics
9 Alex 80045 French
I'd like to reduce the number of rows in this DataFrame by creating a list of classes that each student is taking, and then putting that in the "class" column. Here's my desired output:
Student ID Class
1 John 99124 ["Biology","History","Geometry"]
2 Sarah 74323 ["Physics","Geography","Algebra"]
3 Alex 80045 ["Trigonometry","Economics","French"]
I am working with a large DataFrame that is not as nicely organized as this example. Any help is appreciated.
You need to groupby on Student and ID and then use agg.
df.groupby(['Student', 'ID'], as_index=False).agg({'Class': list})
Ouput:
Student ID Class
0 Alex 80045 [Trigonometry, Economics, French]
1 John 99124 [Biology, History, Geometry]
2 Sarah 74323 [Physics, Geography, Algebra]
df.groupby('ID')['Class'].apply(list)
let's see, using some help
Apply multiple functions to multiple groupby columns
you could write something like
df= df.groupby('student').agg({'id':'max', 'Class': lambda x: x.tolist()})
hope it helps, giulio
try like below
df.groupby(['Student', 'ID'],as_index=False).agg(lambda x:','.join('"'+x+'"'))

Convert nested list of dictionaries to pandas DataFrame

Can anyone suggest me an efficient way to convert list of list of dictionaries as pandas dataframe?
Input = [[{'name':'tom','roll_no':1234,'gender':'male'},
{'name':'sam','roll_no':1212,'gender':'male'}],
[{'name':'kavi','roll_no':1235,'gender':'female'},
{'name':'maha','roll_no':1211,'gender':'female'}]]
The dictionary keys are same in the sample input provided and an expected output is,
Output = name roll_no gender
0 tom 1234 male
1 sam 1212 male
2 kavi 1235 female
3 maha 1211 female
You will need to flatten your input using itertools.chain, and you can then call the pd.DataFrame constructor.
from itertools import chain
pd.DataFrame(list(chain.from_iterable(data)))
gender name roll_no
0 male tom 1234
1 male sam 1212
2 female kavi 1235
3 female maha 1211

Iterating through two pandas dataframes and appending data from one dataframe to the other

I have two pandas data-frames that look like this:
data_frame_1:
index un_id city
1 abc new york
2 def atlanta
3 gei toronto
4 lmn tampa
data_frame_2:
index name un_id
1 frank gei
2 john lmn
3 lisa abc
4 jessica def
I need to match names to cities via the un_id column either in a new data-frame or an existing data-frame. I am having trouble figuring out how to iterate through one column, grab the un_id, iterate through the other un_id column in the other data-frame with that un_id, and then append the information needed back to the original data-frame.
use pandas merge:
In[14]:df2.merge(df1,on='un_id')
Out[14]:
name un_id city
0 frank gei toronto
1 john lmn tampa
2 lisa abc new york
3 jessica def atlanta

Categories