I have data from a json file. I have been able to normalize it to an extent.
I'm wondering if there's an elegant way to convert the remaining data to columns.
I tried to use pd.json_normalize but I get an error on the list in column D.
My next attempt was to separate D and create a Dataframe from the list in D. And normalize each dict individually. And then concat the Dataframes together.
My current issue is that the value of the first key in each dict should be a column name, with the value coming from the key called 'value'. There are five column names across the dicts in column D.
{'key': 'column name', 'value' : 'data value'}
A further complication is that most rows have three dicts, but some have one or two.
I think I could brute force it by swapping the keys and values for the first key in each dict? And then using json_normalise to create columns and values from each dict. But I'm wondering if there's a more elegant way of handling this json data?
I'm trying to turn this:
A
B
C
D
E
0
Value A
Value B
Value c
[{'key': 'column name 1', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} , {'key': 'column name 3', 'value' : 'data value'} ]
1
Value A
Value B
Value c
[{'key': 'column name 1', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} , {'key': 'column name 3', 'value' : 'data value'} ]
2
Value A
Value B
Value c
[{'key': 'column name 1', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} , {'key': 'column name 3', 'value' : 'data value'} ]
3
Value A
Value B
Value c
[{'key': 'column name 1', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} , {'key': 'column name 3', 'value' : 'data value'} ]
4
Value A
Value B
Value c
[{'key': 'column name 4', 'value' : 'data value'} ]
5
Value A
Value B
Value c
[{'key': 'column name 5', 'value' : 'data value'} , {key': 'column name 4', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} ]
6
Value A
Value B
Value c
[{'key': 'column name 5', 'value' : 'data value'} , {key': 'column name 4', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} ]
7
Value A
Value B
Value c
[{'key': 'column name 5', 'value' : 'data value'} , {key': 'column name 4', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} ]
8
Value A
Value B
Value c
[{'key': 'column name 5', 'value' : 'data value'} , {key': 'column name 4', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} ]
9
Value A
Value B
Value c
[{'key': 'column name 5', 'value' : 'data value'} , {key': 'column name 4', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} ]
10
Value A
Value B
Value c
[{'key': 'column name 5', 'value' : 'data value'} , {key': 'column name 4', 'value' : 'data value'} , {key': 'column name 2', 'value' : 'data value'} ]
Into:
A
B
C
D
column name 1
column name 2
column name 3
column name 4
column name 5
0
Value A
Value B
Value c
data value
data value
data value
1
Value A
Value B
Value c
data value
data value
data value
2
Value A
Value B
Value c
data value
data value
data value
3
Value A
Value B
Value c
data value
data value
data value
4
Value A
Value B
Value c
data value
5
Value A
Value B
Value c
data value
data value
6
Value A
Value B
Value c
data value
data value
7
Value A
Value B
Value c
data value
data value
8
Value A
Value B
Value c
data value
data value
9
Value A
Value B
Value c
data value
data value
10
Value A
Value B
Value c
data value
data value
Code:
df = pd.read_json(file)
df2 = pd.DataFrame(df['D'].to_list(),columns = ['list_a','list_b','list_c'])
for column in df2:
df3 = pd.json_normalize(df[column])
df = pd.concat([df,df3], axis = 1)
Related
Suppose I have a Python Dataframe:
Column A
Column B
A
Val 1
A
Val 2
B
Val A
B
Val B
B
Val C
B
Val D
I want to stock-take Column B into a dictionary with key = unique values of Column A, as such:
out = { 'A': ['Val 1','Val 2'],
'B': ['Val A','Val B','Val C','Val D'] }
How would I do that?
I tried making a pivot table but it only allows aggregating column B; I want them as separate value in a list.
One way using pandas.DataFrame.groupby:
out = df.groupby("Column A")["Column B"].apply(list).to_dict()
Output:
{'A': ['Val 1', 'Val 2'], 'B': ['Val A', 'Val B', 'Val C', 'Val D']}
I have a sample dataframe as follows:
Main Key
Second
Column A
Column B
Column C
Column D
Column E
First
A
Value 1
Value 2
Value 3
Value 4
Value 5
Second
B
Value 6
Value 7
Value 8
Value 9
Value 10
Third
C
Value 11
Value 12
Value 13
Value 14
Value 15
Fourth
D
Value 16
Value 17
Value 18
Value 19
Value 20
I want to make a new column called 'Aggregated Data', where I make each value in Columns A to E, as key-value pair, and combine them in 'Aggregated Data' in JSON Format
The expected output would look like this:
Main Key
Second
Aggregated Data
First
A
{"Column A":"Value 1","Column B":"Value 2","Column C":"Value 3","Column D":"Value 4","Column E":"Value 5"}
Second
B
{"Column A":"Value 6","Column B":"Value 7","Column C":"Value 8","Column D":"Value 9","Column E":"Value 10"}
Third
C
{"Column A":"Value 11","Column B":"Value 12","Column C":"Value 13","Column D":"Value 14","Column E":"Value 15"}
Fourth
D
{"Column A":"Value 16","Column B":"Value 17","Column C":"Value 18","Column D":"Value 19","Column E":"Value 20"}
Any idea how this can be achieved? Thanks
Via intermediate pandas.DataFrame.to_dict call (with orient records to obtain lists like [{column -> value}, … , {column -> value}]):
df[['Main Key', 'Second']].assign(Aggregated_Data=df.set_index(['Main Key', 'Second']).to_dict(orient='records'))
Main Key Second Aggregated_Data
0 First A {'Column A': 'Value 1 ', 'Column B': 'Value 2 ...
1 Second B {'Column A': 'Value 6 ', 'Column B': 'Value 7 ...
2 Third C {'Column A': 'Value 11 ', 'Column B': 'Value 1...
3 Fourth D {'Column A': 'Value 16 ', 'Column B': 'Value 1...
Just skip the first two columns and call to_json :
out = (df[["Main Key", "Second"]]
.assign(Aggregated_Data= df.iloc[:, 2:]
.apply(lambda x: x.to_json(), axis=1))
Alternatively, use a dict/listcomp :
df["Aggregated_Data"] = [{k: v for k, v in zip(df.columns[2:], v)}
for v in df.iloc[:,2:].to_numpy()]
Output :
print(out)
Main Key Second Aggregated_Data
0 First A {"Column A":"Value 1","Column B":"Value 2","Co...
1 Second B {"Column A":"Value 6","Column B":"Value 7","Co...
2 Third C {"Column A":"Value 11","Column B":"Value 12","...
3 Fourth D {"Column A":"Value 16","Column B":"Value 17","...
I have some data with duplicates that looks like this:
WEBPAGE
ID
VALUE
Webpage 1
ID 1
Value 1
Webpage 1
ID 1
Value 2
Webpage 1
ID 1
Value 3
Webpage 1
ID 2
Value 4
Webpage 1
ID 2
Value 5
Each webpage can have more than 1 ID associated with it and each ID can have more than one value associated with it.
I'd like to ideally have a nested dictionary with lists to handle the multiple IDs and multiple values:
{WEBPAGE: {ID 1: [value 1, value 2, value 3], ID 2: [value 4, value 5]}}
I've tried using to_dict and group_by but I can't seem to find the right syntax to create a nested dictionary within those.
Try:
out = {}
for _, x in df.iterrows():
out.setdefault(x["WEBPAGE"], {}).setdefault(x["ID"], []).append(x["VALUE"])
print(out)
Prints:
{
"Webpage 1": {
"ID 1": ["Value 1", "Value 2", "Value 3"],
"ID 2": ["Value 4", "Value 5"],
}
}
For a pandas approach, you just need to use a nested groupby:
d = (df.groupby('WEBPAGE')
.apply(lambda g: g.groupby('ID')['VALUE'].agg(list).to_dict())
.to_dict()
)
output:
{'Webpage 1': {'ID 1': ['Value 1', 'Value 2', 'Value 3'],
'ID 2': ['Value 4', 'Value 5']}}
Another possible solution, using dictionary comprehension:
{x: {y: [z for z in df.VALUE[(df.WEBPAGE == x) & (df.ID == y)]]
for y in df.ID[df.WEBPAGE == x]} for x in df.WEBPAGE}
Output:
{'Webpage 1': {'ID 1': ['Value 1', 'Value 2', 'Value 3'],
'ID 2': ['Value 4', 'Value 5']}}
I want to create a Dataframe that contains columns that have 'X' above them in the header, but I'm unable to find a way to get column index numbers.
I would select the right dataframe once I have the column index numbers.
df_right_columns = df_right_column[df_right_column.columns[column_numbers]]
Sample df:
df = pd.DataFrame({'X': ['column1', 'cell 1', 'cell 2', 'cell 3', 'cell 4'],
'X': ['column2', 'cell 2', 'cell 3', 'cell 4', 'cell 6'],
'': ['column3', 'cell 3', 'cell 4', 'cell 5', 'cell 7'],
'X': ['column4', 'cell 4', 'cell 5', 'cell 6', 'cell 8']})
X X X
column1 column2 column3 column4
cell 1 cell 2 cell 3 cell 4
cell 2 cell 3 cell 4 cell 5
cell 3 cell 4 cell 5 cell 6
cell 4 cell 5 cell 6 cell 7
cell 5 cell 6 cell 7 cell 8
I have tried running this dataframe through for loop to try get the index numbers but haven't had any luck so far. I did this by 1) locating the X header row, 2) running this row through for loop to check columns that contain 'X' in the df.iloc[0] row.
df = df.iloc[0]
for cell in df:
if 'X' in cell:
print(cell.index) #this will return an object - <built-in method index of str object at 0x7f23be18a9b0>
print(cell) #this will return the cell value not the index, X in this case
I'm very close and any help would be grealty appreciated, many thanks
Solution
df = pd.DataFrame({'X': ['column1', 'cell 1', 'cell 2', 'cell 3', 'cell 4'],
'X': ['column2', 'cell 2', 'cell 3', 'cell 4', 'cell 6'],
'': ['column3', 'cell 3', 'cell 4', 'cell 5', 'cell 7'],
'X': ['column4', 'cell 4', 'cell 5', 'cell 6', 'cell 8']})
print(df)
X X X
column1 column2 column3 column4
cell 1 cell 2 cell 3 cell 4
cell 2 cell 3 cell 4 cell 5
cell 3 cell 4 cell 5 cell 6
cell 4 cell 5 cell 6 cell 7
cell 5 cell 6 cell 7 cell 8
df_header = df.iloc[0]
column_number = []
i = 0
while i < len(df_header):
for column_index in df_header:
if 'X' in column_index:
column_number.append(i)
i += 1
df = df[df.columns[column_number]]
print(df)
X X X
column1 column2 column4
cell 1 cell 2 cell 4
cell 2 cell 3 cell 5
cell 3 cell 4 cell 6
cell 4 cell 5 cell 7
cell 5 cell 6 cell 8
You can use a MultiIndex:
Xs = ['X', 'X', None, 'X']
df.columns = pd.MultiIndex.from_arrays([Xs, df.columns])
or, from a list of positions:
pos = (0,1,3)
Xs = ['X' if i in pos else '' for i in range(len(df.columns))]
df.columns = pd.MultiIndex.from_arrays([Xs, df.columns])
output:
X X NaN X
column1 column2 column3 column4
0 cell 1 cell 2 cell 3 cell 4
1 cell 2 cell 3 cell 4 cell 5
2 cell 3 cell 4 cell 5 cell 6
3 cell 4 cell 5 cell 6 cell 7
4 cell 5 cell 6 cell 7 cell 8
input:
df = pd.DataFrame({'column1': ['cell 1', 'cell 2', 'cell 3', 'cell 4', 'cell 5'],
'column2': ['cell 2', 'cell 3', 'cell 4', 'cell 5', 'cell 6'],
'column3': ['cell 3', 'cell 4', 'cell 5', 'cell 6', 'cell 7'],
'column4': ['cell 4', 'cell 5', 'cell 6', 'cell 7', 'cell 8']})
try it:
table = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000}]
df = pd.DataFrame(table)
print("total size row: ", df.index.size)
for value in df.values:
print("size col: ", value.size, "value:", value)
output:
total size row: 3
size col: 4 value: [1 2 3 4]
size col: 4 value: [100 200 300 400]
size col: 4 value: [1000 2000 3000 4000]
I have a list of lists:
list = [
['Row 1','Value 1'],
['Row 2', 'Value 2'],
['Row 3', 'Value 3', 'Value 4']
]
And I have a list for dataframe header:
header_list = ['RowID', 'Value']
If I create the DataFrame using df = pd.DataFrame(list, columns = header_list), then python will through me an error says Row3 has more than 2 columns, which is inconsistent with the header_list.
So how can I skip Row 3 when creating the DataFrame. And how to achieve this with "in-place" calculation, which means NOT creating a new list which loops through the original list and append the item with length=2.
Thanks for the help!
First change variable list to L, because list is python code reserved word.
Then for filter use list comprehension:
L = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3', 'Value 4']]
#for omit all rows != 2
df = pd.DataFrame([x for x in L if len(x) == 2], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
#filter last 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[-2:] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Value 3 Value 4
Or:
#filter first 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[:2] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Row 3 Value 3
try below code:
list1 = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3']]
dff = pd.DataFrame(list1)
dff = dff[[x for x in range(len(header_list))]]
dff.columns = header_list