Reshaping strings-as-lists into rows - python

I have a pandas data frame like this:
df = pandas.DataFrame({
'Grouping': ["A", "B", "C"],
'Elements': ['[\"A1\"]', '[\"B1\", \"B2\", \"B3\"]', '[\"C1\", \"C2\"]']
}).set_index('Grouping')
so
Elements
Grouping
===============================
A ["A1"]
B ["B1", "B2", "B3"]
C ["C1", "C2"]
i.e. some lists are encoded as strings-as-lists. What is a clean way to reshape this into a tidy data set like this:
Elements
Grouping
====================
A A1
B B1
B B2
B B3
C C1
C C2
without resorting to a for-loop? The best I can come up with is:
df1 = pandas.DataFrame()
for index, row in df.iterrows():
df_temp = pandas.DataFrame({'Elements': row['Elements'].replace("[\"", "").replace("\"]", "").split('\", \"')})
df_temp['Grouping'] = index
df1 = pandas.concat([df1, df_temp])
df1.set_index('Grouping', inplace=True)
but that's pretty ugly.

You can use .str.extractall():
df.Elements.str.extractall(r'"(.+?)"').reset_index(level="match", drop=True).rename({0:"Elements"}, axis=1)
the result:
Elements
Grouping
A A1
B B1
B B2
B B3
C C1
C C2

You can to convert your 'list' to list , then we doing apply with pd.Series and stack
import ast
df.Elements=df.Elements.apply(ast.literal_eval)
df.Elements.apply(pd.Series).stack().reset_index(level=1,drop=True).to_frame('Elements')
Elements
Grouping
A A1
B B1
B B2
B B3
C C1
C C2

Related

Select all the subcolumns with a given name from pandas dataframe

I have a pandas dataframe df that I built using 3 levels of columns, as follows:
a1 a2 a3
b1 b2 b1 b3 b1 b4
c1 c2 c1 c2 c1 c2 c1 c2 c1 c2 c1 c2
... (data) ...
Note that each a column may have different b subcolumns, but each b column has the same c subcolumns.
I can extract e.g. the subcolumns from a2 using df["a2"].
How can I select based on the second or third level without having to specify the first and second level respectively? For instance I would like to say "give me all the c2 columns you can find" and I would get:
a1 a2 a3
b1 b2 b1 b3 b1 b4
... (data for the c2 columns) ...
Or "give me all the b1 columns" and I would get:
a1 a2 a3
c1 c2 c1 c2 c1 c2
... (data for the b1 columns) ...
The docs provide some info on that. Adapting the examples from there to your example, either use tuples with slice objects you pass None,
df.loc[:, (slice(None), slice(None), "c2")]
or use pd.IndexSliceto use the familiar : notation:
idx = pd.IndexSlice
df.loc[:, idx[:, :, "c2"]]
When you have a dataframe with a MultiIndex the columns are a list of tuples containing the keys of each level.
You can get the columns you want with a list comprehension checking if the name you want is in the tuple:
# get all the "c2"
df[[col for col in df.columns if "c2" in col]]
# get all the "b1"
df[[col for col in df.columns if "b1" in col]]
If you want to be sure that the column name is in the right position:
# get all the "c2"
df[[col for col in df.columns if col[2] == "c2"]]
# get all the "b1"
df[[col for col in df.columns if col[1] == "b1"]]

Combine Two Lists in python in dataframe with file name added in one column and content in another

I have a list of files in a folder in my system
file_list= ["A", "B", "C"]
I Have read the files using a for loop and have obtained content which is as follows
A = ["A1", "B1", "C1"]
B = ["E1", "F1"]
C = []
I would like the following output
Content Name
A1 A
B1 A
C1 A
D1 B
E1 B
C
How do I accomplish this.
Try this
import pandas as pd
data = list(zip((A, B, C), file_list))
df = pd.DataFrame(data, columns=['Content', 'Name'])
df = df.explode('Content')
print(df)
Output:
Content Name
0 A1 A
0 B1 A
0 C1 A
1 E1 B
1 F1 B
2 NaN C

Selecting rows from a pandas dataframe based on the values of some columns in a list of the same dataframe?

Let's suppose, there is a dataframe :
df1 =
A B C
0 1 a a1
1 2 b b2
2 3 c c3
3 4 d d4
4 5 e e5
5 6 f f6
Created as :
a1 = [1,2,3,4,5,6]
a2 = ['a','b','c','d','e','f']
a3 = ['a1','b2','c3','d4','e5','f6']
df1 = pd.DataFrame(list(zip(a1,a2,a3)),columns=["A","B","C"])
Here, I am considering Columns A and B to be something like primary keys for this dataframe.
So, PK = ["A","B"].
I have another list, list1 = [[2,'b'],[5,'e']], which is a subset of the dataframe df[PK].
Is there any way I can get the rows corresponding to these primary key values inside the list from the dataframe df?
Something like :
df1 = df[df[PK].values.isin(list1)] which doesn't work as I expect.
I would like to get an output df1 as :
df1 =
A B C
1 2 b b2
4 5 e e5
There are some similar questions, which I have gone through in this portal. But none of them showed me how to select rows based on filter on multiple columns as mentioned above.
Thanks in advance.
Here is how you can use pandas.DataFrame.merge():
import pandas as pd
a1 = [1,2,3,4,5,6]
a2 = ['a','b','c','d','e','f']
a3 = ['a1','b2','c3','d4','e5','f6']
df1 = pd.DataFrame(list(zip(a1,a2,a3)),columns=["A","B","C"])
PK = ["A","B"]
list1 = [[2,'b'],[5,'e']]
df2 = df1.merge(pd.DataFrame(list1,columns=PK),on=PK)
print(df2)
Output:
A B C
0 2 b b2
1 5 e e5

How to split a column, which contains a list, into different entries in Dataframe? [duplicate]

This question already has answers here:
Flatten a column with value of type list while duplicating the other column's value accordingly in Pandas
(8 answers)
Closed 5 years ago.
I have below data frame
A B C
1 A1 B1 [C1, C2]
2 A2 B2 [C3, C4]
I wish to transform it to
A B C
1 A1 B1 C1
2 A1 B1 C2
3 A2 B2 C3
4 A2 B2 C4
What should I do? Thanks
One really simple way of doing it is as follows:
import pandas as pd
df = pd.DataFrame([['A1', 'B1', ['C1', 'C2']],['A2', 'B2', ['C3', 'C4']]], columns = ['A', 'B', 'C'])
df1 = df.copy()
df1['C'] = df['C'].apply(lambda x: x[0])
df2 = df.copy()
df2['C'] = df['C'].apply(lambda x: x[1])
pd.concat([df1, df2]).sort_values('A')

Fastest way to filter a pandas dataframe on multiple columns

I have a pandas dataframe with several columns that labels data in a final column, for example,
df = pd.DataFrame( {'1_label' : ['a1','b1','c1','d1'],
'2_label' : ['a2','b2','c2','d2'],
'3_label' : ['a3','b3','c3','d3'],
'data' : [1,2,3,4]})
df = 1_label 2_label 3_label data
0 a1 a2 a3 1
1 b1 b2 b3 2
2 c1 c2 c3 3
3 d1 d2 d3 4
and a list of tuples,
list_t = [('a1','a2','a3'), ('d1','d2','d3')]
I want to filter this dataframe and return a new dataframe containing only the rows that correspond to the tuples in my list.
result = 1_label 2_label 3_label data
0 a1 a2 a3 1
1 d1 d2 d3 4
My naive (and C++ inspired) solution was to use append (like vector::push_back)
for l1, l2, l3 in list_t:
if df[(df['1_label'] == l1) &
(df['2_label'] == l2) &
(df['3_label'] == l3)].empty is False:
result = result.append(df[(df['1_label'] == l1) &
(df['2_label'] == l2) &
(df['3_label'] == l3)]
While my solution works I suspect it is horrendously slow for large dataframes and large list of tuples as I think pandas creates a new dataframe upon each call to append. Could anyone suggest a faster/cleaner way to do this? Thanks!
Assuming no duplicates, you could create index out of the columns you want to "filter" on:
In [10]: df
Out[10]:
1_label 2_label 3_label data
0 a1 a2 a3 1
1 b1 b2 b3 2
2 c1 c2 c3 3
3 d1 d2 d3 4
In [11]: df.set_index(['1_label', '2_label', '3_label'])\
.loc[[('a1','a2','a3'), ('d1','d2','d3')]]\
.reset_index()
Out[11]:
1_label 2_label 3_label data
0 a1 a2 a3 1
1 d1 d2 d3 4
If I understood correctly, merge should do the job:
pd.DataFrame(list_t, columns=['1_label', '2_label', '3_label']).merge(df)
Out[73]:
1_label 2_label 3_label data
0 a1 a2 a3 1
1 d1 d2 d3 4

Categories