Reshape/pivot a datafile of key-value pairs with recurring key values

Reshape/pivot a datafile of key-value pairs with recurring key values - python

I found many similarly titled questions, but could not find the exact one I am looking for.
I have a datafile like this:
title1:A1
title2:A2
title3:A3
title4:A4
title5:A5
title1:B1
title2:B2
title3:B3
title4:B4
title5:B5
title1:C1
title2:C2
title3:C3
title4:C4
title5:C5
title1:D1
title2:D2
title3:D3
title4:D4
title5:D5
Using pandas I would like to get a table like this:
title1 title2 title3 title4 title5
0 A1 A2 A3 A4 A5
1 B1 B2 B3 B4 B5
2 C1 C2 C3 C4 C5
3 D1 D2 D3 D4 D5
My attempt:
import pandas as pd
import numpy as np
df = pd.read_csv('colon_sep.txt',header=None,sep=':')
df.columns = ['title','id']
# for loop method
df2 = pd.DataFrame()
for t in df.title.unique():
df2[t] = df[df.title == t]['id'].values
df2
# HOW TO GET THIS BY Advanced methods?
I was able to get the required table using for loop.
Is there a better way using groupby or any other advanced method?

You can simplify your code a bit, to include a pivot call at the end for efficiency:
df = pd.read_csv('colon_sep.txt', sep=':', header=None)
df.insert(2, 2, df.groupby(0).cumcount())
df = df.pivot(index=2, columns=0, values=1)
print(df)
0 title1 title2 title3 title4 title5
2
0 A1 A2 A3 A4 A5
1 B1 B2 B3 B4 B5
2 C1 C2 C3 C4 C5
3 D1 D2 D3 D4 D5

After you do
df = pd.read_csv('colon_sep.txt',header=None,sep=':')
You can do
df= pd.DataFrame({name:list(column[0]) for name,column in dfc.groupby(dfc.index)})
Or, if you have the data in a string called text, you can do
df = pd.DataFrame([[line.split(':')[1] for line in lines.split('\n')] for lines in text.split('\n\n')])`
You can get the column names with
df.columns = [line.split(':')[0] for line in text.split('\n\n')[0].split('\n')]

Related

Select all the subcolumns with a given name from pandas dataframe

I have a pandas dataframe df that I built using 3 levels of columns, as follows:
a1 a2 a3
b1 b2 b1 b3 b1 b4
c1 c2 c1 c2 c1 c2 c1 c2 c1 c2 c1 c2
... (data) ...
Note that each a column may have different b subcolumns, but each b column has the same c subcolumns.
I can extract e.g. the subcolumns from a2 using df["a2"].
How can I select based on the second or third level without having to specify the first and second level respectively? For instance I would like to say "give me all the c2 columns you can find" and I would get:
a1 a2 a3
b1 b2 b1 b3 b1 b4
... (data for the c2 columns) ...
Or "give me all the b1 columns" and I would get:
a1 a2 a3
c1 c2 c1 c2 c1 c2
... (data for the b1 columns) ...

The docs provide some info on that. Adapting the examples from there to your example, either use tuples with slice objects you pass None,
df.loc[:, (slice(None), slice(None), "c2")]
or use pd.IndexSliceto use the familiar : notation:
idx = pd.IndexSlice
df.loc[:, idx[:, :, "c2"]]

When you have a dataframe with a MultiIndex the columns are a list of tuples containing the keys of each level.
You can get the columns you want with a list comprehension checking if the name you want is in the tuple:
# get all the "c2"
df[[col for col in df.columns if "c2" in col]]
# get all the "b1"
df[[col for col in df.columns if "b1" in col]]
If you want to be sure that the column name is in the right position:
# get all the "c2"
df[[col for col in df.columns if col[2] == "c2"]]
# get all the "b1"
df[[col for col in df.columns if col[1] == "b1"]]

How to get data set using python or C++

I have a data look like:
So if I give the name for each row: a1 a2 b1 b2 c1 c2 d1 d2. The rule is: A B C D, you can swap position in each big row. I need to take set of 4 numbers, so I will have:
a1 b1 c1 d1
a2 b1 c1 d1
a1 b2 c2 d2
a2 b2 c2 d2
a1 b1 c2 d2
a2 b1 c2 d2
a1 b1 c1 d2
a2 b1 c1 d2
a1 b2 c1 d2
a2 b2 c1 d2
a1 b1 c1 d2
a2 b1 c1 d2
a1 b1 c2 d1
a2 b1 c2 d1
a1 b2 c2 d1
a2 b2 c2 d1
So when I changed the number I will have many set of data.
How can I filter to take unique of data set. And count how many times It appear for each unique set.

Mm, genetics, it's tasty...
So, for solve this problem in Python you should do:
Grab the data from xmls (It seems like xmls). This, just use pandas: pd.read_excel()
(OPTIONAL STEP) Prepare your data. I see one cell with no value, it might cause some problem.
Create indexes according you wishes (a1,a2 etc.). You can generate it with for-loor with list as return and then use pd.set_index()
Main idea: you create 2 for loops: one for,let's say, statical component (outer loop), another for dynamic comonent (inner loop).
In your example:
a1 b1 c1 d1
a1 b1 c1 d1
a2 b1 c1 d1
Statical is "b1 c1 d1", and dynamic is "a1" --> "a2".
After one iteration the statical component must change "b1 c1 d1" --> "b2 c2 d2".
All iterations must finish with adding the set into the list (list.append(set)) you created.
After operations above, you need to filter this. Steps are:
Create an empty dict where a key is representation of unique element and value is number of time it appears
Make the for loop like:
for set in list_of_sets: if set not in dict: dict[set] = 1 else: dict[set] += 1
Or you can use collection.Counter or np.unique()(EXAMPLE).
I hope it will help you with your task.

Thank you Roman_N, here is my code:
import pandas as pd
import xlrd
import functools, operator
import itertools
from collections import Counter
df = pd.read_csv("BN.csv")
result = []
for index,row in df.iterrows():
s = [[row['a1'],row['a2']], [row['b1'],row['b2']], [row['c1'],row['c2']], [row['d1'],row['d2']]]
for item in list(itertools.product(*s)):
result.append(item)
# print(result)
counts = Counter(item for item in result)
for element in counts:
print(element, counts[element])
print(list, 'length is', len(counts))

read_excel when tab starts with empty cell

I am trying to read an excel file with multiple sheets as follows:
sumtech = pd.read_excel('excelfile.xlsx', sheet_name=None)
One of the sheets has the following format:
c3 c4
d1 d2 d3 d4
b1 b2 b3 b4
Since this sheet starts with empty cells (no header), only columns 3 and 4 are read. How do I tell pandas to read the whole table?

Use header=None. You can later define the columns by hand or use the first row to use as column names, also use skiprows to your advantage -
import pandas as pd
df = pd.read_excel('excelfile.xlsx', header=None, skiprows=1)
df.columns = ['c1', 'c2', 'c3', 'c4']
print(df)
Output
c1 c2 c3 c4
0 d1 d2 d3 d4
1 b1 b2 b3 b4

Pandas reshaping repeating rows

I want to reshape a dataframe with repeating rows. The data comes from a csv file where blocks of data are repeated.
As an example:
Name 1st 2nd
0 Value1 a1 b1
1 Value2 a2 b2
2 Value3 a3 b3
3 Value1 a4 b4
4 Value2 a5 b5
5 Value3 a6 b6
Shall be reshaped into:
Name 1st 2nd 3rd 4th
Value1 a1 b1 a4 b4
Value2 a2 b2 a5 b5
Value3 a3 b3 a6 b6
Do you have any suggestions how to do this?
I've already watched this thread, however I can not see how to translate this approach to my problem, where there is more than one column right of the column the groupby is worked on.

You can use set_index and stack to combine your two columns into one, cumcount to get the new column labels, and pivot to do the reshaping:
# Stack the 1st and 2nd columns, and use cumcount to get the new column labels.
df = df.set_index('Name').stack().reset_index(level=1, drop=True).to_frame()
df['new_col'] = df.groupby(level='Name').cumcount()
# Perform a pivot to get the desired shape.
df = df.pivot(columns='new_col', values=0)
# Formatting.
df = df.reset_index().rename_axis(None, 1)
The resulting output:
Name 0 1 2 3
0 Value1 a1 b1 a4 b4
1 Value2 a2 b2 a5 b5
2 Value3 a3 b3 a6 b6

Create a dataframe with repeated values of df after grouping by Name and merge that df with the original.
df1 = df.groupby('Name')['1st', '2nd'].apply(lambda x: x.iloc[1]).reset_index()
df1.columns = ['Name', '3rd', '4th']
df = df.drop_duplicates(subset=['Name']).merge(df1, on = 'Name')
You get
Name 1st 2nd 3rd 4th
0 Value1 a1 b1 a4 b4
1 Value2 a2 b2 a5 b5
2 Value3 a3 b3 a6 b6

Fastest way to filter a pandas dataframe on multiple columns

I have a pandas dataframe with several columns that labels data in a final column, for example,
df = pd.DataFrame( {'1_label' : ['a1','b1','c1','d1'],
'2_label' : ['a2','b2','c2','d2'],
'3_label' : ['a3','b3','c3','d3'],
'data' : [1,2,3,4]})
df = 1_label 2_label 3_label data
0 a1 a2 a3 1
1 b1 b2 b3 2
2 c1 c2 c3 3
3 d1 d2 d3 4
and a list of tuples,
list_t = [('a1','a2','a3'), ('d1','d2','d3')]
I want to filter this dataframe and return a new dataframe containing only the rows that correspond to the tuples in my list.
result = 1_label 2_label 3_label data
0 a1 a2 a3 1
1 d1 d2 d3 4
My naive (and C++ inspired) solution was to use append (like vector::push_back)
for l1, l2, l3 in list_t:
if df[(df['1_label'] == l1) &
(df['2_label'] == l2) &
(df['3_label'] == l3)].empty is False:
result = result.append(df[(df['1_label'] == l1) &
(df['2_label'] == l2) &
(df['3_label'] == l3)]
While my solution works I suspect it is horrendously slow for large dataframes and large list of tuples as I think pandas creates a new dataframe upon each call to append. Could anyone suggest a faster/cleaner way to do this? Thanks!

Assuming no duplicates, you could create index out of the columns you want to "filter" on:
In [10]: df
Out[10]:
1_label 2_label 3_label data
0 a1 a2 a3 1
1 b1 b2 b3 2
2 c1 c2 c3 3
3 d1 d2 d3 4
In [11]: df.set_index(['1_label', '2_label', '3_label'])\
.loc[[('a1','a2','a3'), ('d1','d2','d3')]]\
.reset_index()
Out[11]:
1_label 2_label 3_label data
0 a1 a2 a3 1
1 d1 d2 d3 4

If I understood correctly, merge should do the job:
pd.DataFrame(list_t, columns=['1_label', '2_label', '3_label']).merge(df)
Out[73]:
1_label 2_label 3_label data
0 a1 a2 a3 1
1 d1 d2 d3 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reshape/pivot a datafile of key-value pairs with recurring key values - python

Related

Select all the subcolumns with a given name from pandas dataframe

How to get data set using python or C++

read_excel when tab starts with empty cell

Pandas reshaping repeating rows

Fastest way to filter a pandas dataframe on multiple columns

Categories

Resources