In the following code, I have defined a dictionary and then converted it to a dataframe
my_dict = {
'A' : [1,2],
'B' : [4,5,6]
}
df = pd.DataFrame()
df = df.append(my_dict, ignore_index=True)
The output is a [1 rows x 2 columns] dataframe which looks like
A B
0 [1,2] [4,5,6]
However, I would like to reshape it as
A B
0 1 4
1 2 5
2 6
How can I fix the code for that purpose?
You might use pandas.Series.explode as follows
import pandas as pd
my_dict = {
'A' : [1,2],
'B' : [4,5,6]
}
df = pd.DataFrame()
df = df.append(my_dict, ignore_index=True)
df = df.apply(lambda x:x.explode(ignore_index=True))
print(df)
output
A B
0 1 4
1 2 5
2 NaN 6
I apply explode to each column with ignore_index=True which prevent duplicate indices.
Another way of doing this is:
df = pd.DataFrame.from_dict(my_dict, 'index').T
print(df)
Output:
A B
0 1.0 4.0
1 2.0 5.0
2 NaN 6.0
This will give you the results you are looking for if you don't mind changing your code a little
my_dict = {
'A' : [1,2,''],
'B' : [4,5,6]
}
df = pd.DataFrame(my_dict)
df
Try this instead. you need to assign the dictionary to the dataframe. I've run it. It should give you the output you desire. don't use the append. It's to append one dataframe to another
import pandas as pd
my_dict = {
'A' : [1,2,''],
'B' : [4,5,6]
}
df = pd.DataFrame(data=my_dict)
#df = df.append(my_dict, ignore_index=True)
print(df)
Related
I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!
Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
I have a dataframe which has a dictionary inside a nested list and i want to split the column 'C' :
A B C
1 a [ {"id":2,"Col":{"x":3,"y":4}}]
2 b [ {"id":5,"Col":{"x":6,"y":7}}]
expected output :
A B C_id Col_x Col_y
1 a 2 3 4
2 b 5 6 7
From the comments, json_normalize might help you.
After extracting id and col columns with:
df[["Col", "id"]] = df["C"].apply(lambda x: pd.Series(x[0]))
You can explode the dictionary in Col with json_normalize and use concat to merge with existing dataframe:
df = pd.concat([df, json_normalize(df.Col)], axis=1)
Also, use drop to remove old columns.
Full code:
# Import modules
import pandas as pd
from pandas.io.json import json_normalize
# from flatten_json import flatten
# Create dataframe
df = pd.DataFrame([[1, "a", [ {"id":2,"Col":{"x":3,"y":4}}]],
[2, "b", [ {"id":5,"Col":{"x":6,"y":7}}]]],
columns=["A", "B", "C"])
# Add col and id column + remove old "C" column
df = pd.concat([df, df["C"].apply(lambda x: pd.Series(x[0]))], axis=1) \
.drop("C", axis=1)
print(df)
# A B Col id
# 0 1 a {'x': 3, 'y': 4} 2
# 1 2 b {'x': 6, 'y': 7} 5
# Show json_normalize behavior
print(json_normalize(df.Col))
# x y
# 0 3 4
# 1 6 7
# Explode dict in "col" column + remove "Col" colun
df = pd.concat([df, json_normalize(df.Col)], axis=1) \
.drop(["Col"], axis=1)
print(df)
# A B id x y
# 0 1 a 2 3 4
# 1 2 b 5 6 7
You can try .apply method
df['C_id'] = df['C'].apply(lambda x: x[0]['id'])
df['C_x'] = df['C'].apply(lambda x: x[0]['Col']['x'])
df['C_y'] = df['C'].apply(lambda x: x[0]['Col']['y'])
Code
import pandas as pd
A = [1, 2]
B = ['a', 'b']
C = [{"id":2,"Col":{"x":3,"y":4}}, {"id":5,"Col":{"x":6,"y":7}}]
df = pd.DataFrame({"A": A, "B": B, "C_id": [element["id"] for element in C],
"Col_x": [element["Col"]["x"] for element in C],
"Col_y": [element["Col"]["y"] for element in C]})
Ouput:
This question already has answers here:
How to explode a list inside a Dataframe cell into separate rows
(12 answers)
Closed 3 years ago.
I have a dataframe in pandas like this:
id info
1 [1,2]
2 [3]
3 []
And I want to split it into different rows like this:
id info
1 1
1 2
2 3
3 NaN
How can I do this?
You can try this out:
>>> import pandas as pd
>>> df = pd.DataFrame({'id': [1,2,3], 'info': [[1,2],[3],[]]})
>>> s = df.apply(lambda x: pd.Series(x['info']), axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'info'
>>> df2 = df.drop('info', axis=1).join(s)
>>> df2['info'] = pd.Series(df2['info'], dtype=object)
>>> df2
id info
0 1 1
0 1 2
1 2 3
2 3 NaN
Similar question is posted in here
This is rather convoluted way, which drops empty cells:
import pandas as pd
df = pd.DataFrame({'id': [1,2,3],
'info': [[1,2], [3], [ ]]})
unstack_df = df.set_index(['id'])['info'].apply(pd.Series)\
.stack()\
.reset_index(level=1, drop=True)
unstack_df = unstack_df.reset_index()
unstack_df.columns = ['id', 'info']
unstack_df
>>
id info
0 1 1.0
1 1 2.0
2 2 3.0
Here's one way using np.repeat and itertools.chain. Converting empty lists to {np.nan} is a trick to fool Pandas into accepting an iterable as a value. This allows chain.from_iterable to work error-free.
import numpy as np
from itertools import chain
df.loc[~df['info'].apply(bool), 'info'] = {np.nan}
res = pd.DataFrame({'id': np.repeat(df['id'], df['info'].map(len).values),
'info': list(chain.from_iterable(df['info']))})
print(res)
id info
0 1 1.0
0 1 2.0
1 2 3.0
2 3 NaN
Try these methods too...
Method 1
def split_dataframe_rows(df,column_selectors):
# we need to keep track of the ordering of the columns
def _split_list_to_rows(row,row_accumulator,column_selector):
split_rows = {}
max_split = 0
for column_selector in column_selectors:
split_row = row[column_selector]
split_rows[column_selector] = split_row
if len(split_row) > max_split:
max_split = len(split_row)
for i in range(max_split):
new_row = row.to_dict()
for column_selector in column_selectors:
try:
new_row[column_selector] = split_rows[column_selector].pop(0)
except IndexError:
new_row[column_selector] = ''
row_accumulator.append(new_row)
new_rows = []
df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors))
new_df = pd.DataFrame(new_rows, columns=df.columns)
return new_df
Method 2
def flatten_data(json = None):
df = pd.DataFrame(json)
list_cols = [col for col in df.columns if type(df.loc[0, col]) == list]
for i in range(len(list_cols)):
col = list_cols[i]
meta_cols = [col for col in df.columns if type(df.loc[0, col]) != list] + list_cols[i+1:]
json_data = df.to_dict('records')
df = json_normalize(data=json_data, record_path=col, meta=meta_cols, record_prefix=col+str('_'), sep='_')
return json_normalize(df.to_dict('records'))
I want to initialize a 31756x2 data frame of strings.
I want it to look like this:
index column1 column2
0 A B
1 A B
.
.
31756 A B
I wrote:
content_split = [["A", "B"] for x in range(31756)]
This is the result:
I did get a two dimensional list, but I want the columns to be separated like in a data frame, and I can't seem to get it to work (like column1: A.. , column2: B...)
Would love some help.
Use DataFrame constructor only:
df = pd.DataFrame([["A", "B"] for x in range(31756)], columns=['col1','col2'])
print (df.head())
col1 col2
0 A B
1 A B
2 A B
3 A B
4 A B
Or:
N = 31756
df = pd.DataFrame({'col1':['A'] * N, 'col2':['B'] * N})
print (df.head())
col1 col2
0 A B
1 A B
2 A B
3 A B
4 A B
import pandas as pd
df = pd.DataFrame(index=range(31756))
df.loc[:,'column1'] = 'A'
df.loc[:,'column2'] = 'B'
Using numpy.tile:
import numpy as np
df = pd.DataFrame(np.tile(list('AB'), (31756, 1)), columns=['col1','col2'])
Or just passing a dictionary:
df = pd.DataFrame({'A':['A']*31756, 'B':['B']*31756})
If using this latter method you may want to explicitly sort the columns since the dictionary doesn't have order:
df = pd.DataFrame({'A':['A']*31756, 'B':['B']*31756}).sort_index(axis=1)
For fun
pd.DataFrame(index=range(31756)).assign(dict(col1='A', col2='B'))
I have a valid json file with the following format that I am trying to load into pandas.
{
"testvalues": [
[1424754000000, 0.7413],
[1424840400000, 0.7375],
[1424926800000, 0.7344],
[1425013200000, 0.7375],
[1425272400000, 0.7422],
[1425358800000, 0.7427]
]
}
There is a Pandas function called read_json() that takes in json files/buffers and spits out the dataframe but I have not been able to get it to load correctly, which is to show two columns rather than a single column with elements looking like [1424754000000, 0.7413]. I have tried different 'orient' and 'typ' to no avail. What options should I pass into the function to get it to spit out a two column dataframe corresponding the timestamp and the value?
You can use list comprehension with DataFrame contructor:
import pandas as pd
df = pd.read_json('file.json')
print df
testvalues
0 [1424754000000, 0.7413]
1 [1424840400000, 0.7375]
2 [1424926800000, 0.7344]
3 [1425013200000, 0.7375]
4 [1425272400000, 0.7422]
5 [1425358800000, 0.7427]
print pd.DataFrame([x for x in df['testvalues']], columns=['a','b'])
a b
0 1424754000000 0.7413
1 1424840400000 0.7375
2 1424926800000 0.7344
3 1425013200000 0.7375
4 1425272400000 0.7422
5 1425358800000 0.7427
I'm not sure about pandas read_json but IIUC you could do that with astype(str), str.split, str.strip:
d = {
"testvalues": [
[1424754000000, 0.7413],
[1424840400000, 0.7375],
[1424926800000, 0.7344],
[1425013200000, 0.7375],
[1425272400000, 0.7422],
[1425358800000, 0.7427]
]
}
df = pd.DataFrame(d)
res = df.testvalues.astype(str).str.strip('[]').str.split(', ', expand=True)
In [112]: df
Out[112]:
testvalues
0 [1424754000000, 0.7413]
1 [1424840400000, 0.7375]
2 [1424926800000, 0.7344]
3 [1425013200000, 0.7375]
4 [1425272400000, 0.7422]
5 [1425358800000, 0.7427]
In [113]: res
Out[113]:
0 1
0 1424754000000 0.7413
1 1424840400000 0.7375
2 1424926800000 0.7344
3 1425013200000 0.7375
4 1425272400000 0.7422
5 1425358800000 0.7427
You can apply a function that splits it into a pd.Series.
Say you start with
df = pd.read_json(s)
Then just apply a splitting function:
>>> df.apply(
lambda r: pd.Series({'l': r[0][0], 'r': r[0][1]}),
axis=1)
l r
0 1.424754e+12 0.7413
1 1.424840e+12 0.7375
2 1.424927e+12 0.7344
3 1.425013e+12 0.7375
4 1.425272e+12 0.7422
5 1.425359e+12 0.7427