delimit/split row values and form individual rows

delimit/split row values and form individual rows - python

reproducible code for data:
import pandas as pd
dict = {"a": "[1,2,3,4]", "b": "[1,2,3,4]"}
dict = pd.DataFrame(list(dict.items()))
dict
0 1
0 a [1,2,3,4]
1 b [1,2,3,4]
I wanted to split/delimit "column 1" and create individual rows for each split values.
expected output:
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
Should I be removing the brackets first and then split the values? I really don't get any idea of doing this. Any reference that would help me solve this please?

Based on the logic from that answer:
s = d[1]\
.apply(lambda x: pd.Series(eval(x)))\
.stack()
s.index = s.index.droplevel(-1)
s.name = "split"
d.join(s).drop(1, axis=1)

Because you have strings containing a list (and not lists) in your cells, you can use eval:
dict_v = {"a": "[1,2,3,4]", "b": "[1,2,3,4]"}
df = pd.DataFrame(list(dict_v.items()))
df = (df.rename(columns={0:'l'}).set_index('l')[1]
.apply(lambda x: pd.Series(eval(x))).stack()
.reset_index().drop('level_1',1).rename(columns={'l':0,0:1}))
or another way could be to create a DataFrame (probably faster) such as:
df = (pd.DataFrame(df[1].apply(eval).tolist(),index=df[0])
.stack().reset_index(level=1, drop=True)
.reset_index(name='1'))
your output is
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
all the rename are to get exactly your input/output

Related

Merge columns with more than one value in pandas dataframe

I've got this DataFrame in Python using pandas:
Column 1
Column 2
Column 3
hello
a,b,c
1,2,3
hi
b,c,a
4,5,6
The values in column 3 belong to the categories in column 2.
Is there a way to combine columns 2 and 3 that I get this output?
Column 1
a
b
c
hello
1
2
3
hi
6
4
5
Any advise will be very helpful! Thank you!

df.apply(lambda x: pd.Series(x['Column 3'].split(','), index=x['Column2'].split(',')), axis=1)
output:
a b c
0 1 2 3
1 4 5 6
result make to df1 and concat
df1 = df.apply(lambda x: pd.Series(x['Column 3'].split(','), index=x['Column2'].split(',')), axis=1)
pd.concat([df['Column 1'], df1], axis=1)
output:
col1 a b c
0 hello 1 2 3
1 hi 4 5 6

You can use pd.crosstab after exploding the commas:
new_df = ( df.assign(t=df['Column 2'].str.split(','), a=df['Column 3'].str.split(',')).
explode(['t', 'a']) )
output = ( pd.crosstab(index=new_df['Column 1'], columns=new_df['t'],
values=new_df['a'], aggfunc='sum').reset_index() )
Output:
t Column 1 a b c
0 hello 1 2 3
1 hi 4 5 6

Efficiency wise, I'd say do all the wrangling in vanilla python and create a new dataframe:
from collections import defaultdict
outcome = defaultdict(list)
for column, row in zip(df['Column 2'], df['Column 3']):
column = column.split(',')
row = row.split(',')
for first, last in zip(column, row):
outcome[first].append(last)
pd.DataFrame(outcome).assign(Column = df['Column 1'])
a b c Column
0 1 2 3 hello
1 6 4 5 hi

Pandas row value string parsing (mixed string and float) [duplicate]

I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.

You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2

Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.

values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)

Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2

Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2

Pandas Python : how to create multiple columns from a list

I have a list with columns to create :
new_cols = ['new_1', 'new_2', 'new_3']
I want to create these columns in a dataframe and fill them with zero :
df[new_cols] = 0
Get error :
"['new_1', 'new_2', 'new_3'] not in index"
which is true but unfortunate as I want to create them...
EDIT : This is a duplicate of this question : Add multiple empty columns to pandas DataFrame however I keep this one too because the accepted answer here was the simple solution I was looking for, and it was not he accepted answer out there
EDIT 2 : While the accepted answer is the most simple, interesting one-liner solutions were posted below

You need to add the columns one by one.
for col in new_cols:
df[col] = 0
Also see the answers in here for other methods.

Use assign by dictionary:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
print (df)
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 d 8
new_cols = ['new_1', 'new_2', 'new_3']
df = df.assign(**dict.fromkeys(new_cols, 0))
print (df)
A B new_1 new_2 new_3
0 a 0 0 0 0
1 a 1 0 0 0
2 a 2 0 0 0
3 a 3 0 0 0
4 b 4 0 0 0
5 b 5 0 0 0
6 b 6 0 0 0
7 c 7 0 0 0
8 d 8 0 0 0

import pandas as pd
new_cols = ['new_1', 'new_2', 'new_3']
df = pd.DataFrame.from_records([(0, 0, 0)], columns=new_cols)
Is this what you're looking for ?

You can use assign:
new_cols = ['new_1', 'new_2', 'new_3']
values = [0, 0, 0] # could be anything, also pd.Series
df = df.assign(**dict(zip(new_cols, values)

Try looping through the column names before creating the column:
for col in new_cols:
df[col] = 0

We can use the Apply function to loop through the columns in the dataframe and assigning each of the element to a new field
for instance for a list in a dataframe with a list named keys
[10,20,30]
In your case since its all 0 we can directly assign them as 0 instead of looping through. But if we have values we can populate them as below
...
df['new_01']=df['keys'].apply(lambda x: x[0])
df['new_02']=df['keys'].apply(lambda x: x[1])
df['new_03']=df['keys'].apply(lambda x: x[2])

Convert Python dict of Arrays into a dataframe

I have a dictionary of arrays like the following:
d = {'a': [1,2], 'b': [3,4], 'c': [5,6]}
I want to create a pandas dataframe like this:
0 1 2
0 a 1 2
1 b 3 4
2 c 5 6
I wrote the following code:
pd.DataFrame(list(d.items()))
which returns:
0 1
0 a [1,2]
1 b [3,4]
2 c [5,6]
Do you know how can I achieve my goal?!
Thank you in advance.

Pandas allows you to do this in a straightforward fashion:
pd.DataFrame.from_dict(d,orient = 'index')
>> 0 1
a 1 2
b 3 4
c 5 6
pd.DataFrame.from_dict(d,orient = 'index').reset_index() gives you what you are looking for.

Use the splat operator in a comprehension to produce your dataframe:
pd.DataFrame([k, *v] for k, v in d.items())
0 1 2
0 a 1 2
1 b 3 4
2 c 5 6
If you don't mind having index as one of your column names, simply transpose and reset_index:
pd.DataFrame(d).T.reset_index()
index 0 1
0 a 1 2
1 b 3 4
2 c 5 6
Finally, although it's rather ugly, the most performant option I could find on very large dictionaries is the following:
pd.DataFrame(list(d.values()), index=list(d.keys())).reset_index()

Adding a column in dataframes based on similar columns in them

I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.

You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5

df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

delimit/split row values and form individual rows - python

Based on the logic from that answer: s = d[1]\ .apply(lambda x: pd.Series(eval(x)))\ .stack() s.index = s.index.droplevel(-1) s.name = "split" d.join(s).drop(1, axis=1)

Related

Merge columns with more than one value in pandas dataframe

Pandas row value string parsing (mixed string and float) [duplicate]

Pandas Python : how to create multiple columns from a list

Convert Python dict of Arrays into a dataframe

Adding a column in dataframes based on similar columns in them

Categories

Resources