Generate pivot data in Python - python

Suppose I have 100 files, and loop through all of them. In each file, there are records of several attributes: (the total number of attributes are not known before reading all the files)
Assume a simple case that after reading all the files, we obtain 20 different attributes and the following information:
File_001: a1, a3, a5, a2
File_002: a1, a3
File_003: a4
File_004: a4, a2, a6
File_005: a7, a8, a9
...
File_100: a19, a20
[Update] Or in another representation, where each line is a single match between one File and one attribute:
File_001: a1
File_001: a3
File_001: a5
File_001: a2
File_002: a1
File_002: a3
File_003: a4
File_004: a4
File_004: a2
File_004: a6
...
File_100: a19
File_100: a20
How can I generate the "reverse" statistics table, i.e.:
a1: File_001, File_002, File_006, File_083
a2: File_001, File_004
...
a20: File_099, File_100
How can I do it in Python (2.7.x)? (and with or without Pandas. I think Pandas might help)

UPDATE2: How can I generate the "reverse" statistics table
In [9]: df
Out[9]:
file attr
0 File_001 a1
1 File_001 a3
2 File_001 a5
3 File_001 a2
4 File_002 a1
5 File_002 a3
6 File_003 a4
7 File_004 a4
8 File_004 a2
9 File_004 a6
10 File_100 a19
11 File_100 a20
In [10]: df.groupby('attr')['file'].apply(list)
Out[10]:
attr
a1 [File_001, File_002]
a19 [File_100]
a2 [File_001, File_004]
a20 [File_100]
a3 [File_001, File_002]
a4 [File_003, File_004]
a5 [File_001]
a6 [File_004]
Name: file, dtype: object
UPDATE:
How can I set output[202] as DataFrame?
new = (df.set_index('file')
.apply(lambda x: pd.Series(x['attr']), axis=1)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='attr')
.groupby('attr')['file']
.apply(list)
)
so I can export it to html or csv?
new.to_csv('/path/to/file.csv', index=False)
or
html_text = new.to_html(index=False)
Original answer:
Here is a pandas solution:
Original DF:
In [201]: df
Out[201]:
file attr
0 File_001 [a1, a3, a5, a2]
1 File_002 [a1, a3]
2 File_003 [a4]
3 File_004 [a4, a2, a6]
4 File_005 [a7, a8, a9]
5 File_100 [a19, a20]
Solution:
In [202]: %paste
(df.set_index('file')
.apply(lambda x: pd.Series(x['attr']), axis=1)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='attr')
.groupby('attr')['file']
.apply(list)
)
## -- End pasted text --
Output:
Out[202]:
attr
a1 [File_001, File_002]
a19 [File_100]
a2 [File_001, File_004]
a20 [File_100]
a3 [File_001, File_002]
a4 [File_003, File_004]
a5 [File_001]
a6 [File_004]
a7 [File_005]
a8 [File_005]
a9 [File_005]
Name: file, dtype: object

While reading files; for each attribute you read, check a map to see whether keys includes the attribute. If not, add it, then add the file name that you have read that attribute from to values of that key and if the attribute is already a key of the map, then just add the filename as a value.

Related

Column with alpha numeric need to be sorted based on numeric part splitteed by "." and formated as required as new column in the same dataframe [duplicate]

I have a pandas DataFrame with indices I want to sort naturally. Natsort doesn't seem to work. Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process. Any thoughts on how I can resort the indices naturally?
from natsort import natsorted
import pandas as pd
# An unsorted list of strings
a = ['0hr', '128hr', '72hr', '48hr', '96hr']
# Sorted incorrectly
b = sorted(a)
# Naturally Sorted
c = natsorted(a)
# Use a as the index for a DataFrame
df = pd.DataFrame(index=a)
# Sorted Incorrectly
df2 = df.sort()
# Natsort doesn't seem to work
df3 = natsorted(df)
print(a)
print(b)
print(c)
print(df.index)
print(df2.index)
print(df3.index)
Using sort_values for pandas >= 1.1.0
With the new key argument in DataFrame.sort_values, since pandas 1.1.0, we can directly sort a column without setting it as an index using natsort.natsort_keygen:
df = pd.DataFrame({
"time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
"value": [10, 20, 30, 40, 50]
})
time value
0 0hr 10
1 128hr 20
2 72hr 30
3 48hr 40
4 96hr 50
from natsort import natsort_keygen
df.sort_values(
by="time",
key=natsort_keygen()
)
time value
0 0hr 10
3 48hr 40
2 72hr 30
4 96hr 50
1 128hr 20
Now that pandas has support for key in both sort_values and sort_index you should now refer to this other answer and send all upvotes there as it is now the correct answer.
I will leave my answer here for people stuck on old pandas versions, or as a historical curiosity.
The accepted answer answers the question being asked. I'd like to also add how to use natsort on columns in a DataFrame, since that will be the next question asked.
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted, index_natsorted, order_by_index
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df
Out[4]:
a b
0hr a5 b1
128hr a1 b1
72hr a10 b2
48hr a2 b2
96hr a12 b1
As the accepted answer shows, sorting by the index is fairly straightforward:
In [5]: df.reindex(index=natsorted(df.index))
Out[5]:
a b
0hr a5 b1
48hr a2 b2
72hr a10 b2
96hr a12 b1
128hr a1 b1
If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered. natsort provides the convenience functions index_natsorted and order_by_index to do just that.
In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip (or itertools.izip on Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc...
In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]:
a b
0hr a5 b1
96hr a12 b1
128hr a1 b1
48hr a2 b2
72hr a10 b2
Here is an alternate method using Categorical objects that I have been told by the pandas devs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndex which will allow this method to be used on an index.
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df.a = df.a.astype('category')
In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)
In [6]: df.b = df.b.astype('category')
In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)
In [9]: df.sort('a')
Out[9]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [10]: df.sort('b')
Out[10]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
In [11]: df.sort(['b', 'a'])
Out[11]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
The Categorical object lets you define a sorting order for the DataFrame to use. The elements given when calling reorder_categories must be unique, hence the call to set for column "b".
I leave it to the user to decide if this is better than the reindex method or not, since it requires you to sort the column data independently before sorting within the DataFrame (although I imagine that second sort is rather efficient).
Full disclosure, I am the natsort author.
If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:
In [7]:
df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')
Note that df.index = natsorted(df.index) also works
if you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:
In [10]:
natsorted(df)
Out[10]:
[]
EDIT
If you want to sort the index so that the data is reordered along with the index then use reindex:
In [13]:
df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
0
0hr 0
128hr 1
72hr 2
48hr 3
96hr 4
In [14]:
df = df*2
df
Out[14]:
0
0hr 0
128hr 2
72hr 4
48hr 6
96hr 8
In [15]:
df.reindex(index=natsorted(df.index))
Out[15]:
0
0hr 0
48hr 6
72hr 4
96hr 8
128hr 2
Note that you have to assign the result of reindex to either a new df or to itself, it does not accept the inplace param.

How to get data set using python or C++

I have a data look like:
So if I give the name for each row: a1 a2 b1 b2 c1 c2 d1 d2. The rule is: A B C D, you can swap position in each big row. I need to take set of 4 numbers, so I will have:
a1 b1 c1 d1
a2 b1 c1 d1
a1 b2 c2 d2
a2 b2 c2 d2
a1 b1 c2 d2
a2 b1 c2 d2
a1 b1 c1 d2
a2 b1 c1 d2
a1 b2 c1 d2
a2 b2 c1 d2
a1 b1 c1 d2
a2 b1 c1 d2
a1 b1 c2 d1
a2 b1 c2 d1
a1 b2 c2 d1
a2 b2 c2 d1
So when I changed the number I will have many set of data.
How can I filter to take unique of data set. And count how many times It appear for each unique set.
Mm, genetics, it's tasty...
So, for solve this problem in Python you should do:
Grab the data from xmls (It seems like xmls). This, just use pandas: pd.read_excel()
(OPTIONAL STEP) Prepare your data. I see one cell with no value, it might cause some problem.
Create indexes according you wishes (a1,a2 etc.). You can generate it with for-loor with list as return and then use pd.set_index()
Main idea: you create 2 for loops: one for,let's say, statical component (outer loop), another for dynamic comonent (inner loop).
In your example:
a1 b1 c1 d1
a1 b1 c1 d1
a2 b1 c1 d1
Statical is "b1 c1 d1", and dynamic is "a1" --> "a2".
After one iteration the statical component must change "b1 c1 d1" --> "b2 c2 d2".
All iterations must finish with adding the set into the list (list.append(set)) you created.
After operations above, you need to filter this. Steps are:
Create an empty dict where a key is representation of unique element and value is number of time it appears
Make the for loop like:
for set in list_of_sets: if set not in dict: dict[set] = 1 else: dict[set] += 1
Or you can use collection.Counter or np.unique()(EXAMPLE).
I hope it will help you with your task.
Thank you Roman_N, here is my code:
import pandas as pd
import xlrd
import functools, operator
import itertools
from collections import Counter
df = pd.read_csv("BN.csv")
result = []
for index,row in df.iterrows():
s = [[row['a1'],row['a2']], [row['b1'],row['b2']], [row['c1'],row['c2']], [row['d1'],row['d2']]]
for item in list(itertools.product(*s)):
result.append(item)
# print(result)
counts = Counter(item for item in result)
for element in counts:
print(element, counts[element])
print(list, 'length is', len(counts))

Creating arrays from long table structure

I have 24MM rows of data that look like this:
event_date event_id incoming_event_id
2018-12-21 A1 A2
2019-07-20 A2 A3
2018-03-21 B1 B2
2016-08-09 C1 C2
2017-04-02 C2 C3
2018-11-10 C3 C4
What I want to do is create an array for each grouping of events. In this case they would look like:
event_groups
[A1, A2, A3]
[B1, B2]
[C1, C2, C3, C4]
The lengths of these arrays could go on for a while, I suspect up to 100. What is the most efficient way of going about doing this?

Reshape/pivot a datafile of key-value pairs with recurring key values

I found many similarly titled questions, but could not find the exact one I am looking for.
I have a datafile like this:
title1:A1
title2:A2
title3:A3
title4:A4
title5:A5
title1:B1
title2:B2
title3:B3
title4:B4
title5:B5
title1:C1
title2:C2
title3:C3
title4:C4
title5:C5
title1:D1
title2:D2
title3:D3
title4:D4
title5:D5
Using pandas I would like to get a table like this:
title1 title2 title3 title4 title5
0 A1 A2 A3 A4 A5
1 B1 B2 B3 B4 B5
2 C1 C2 C3 C4 C5
3 D1 D2 D3 D4 D5
My attempt:
import pandas as pd
import numpy as np
df = pd.read_csv('colon_sep.txt',header=None,sep=':')
df.columns = ['title','id']
# for loop method
df2 = pd.DataFrame()
for t in df.title.unique():
df2[t] = df[df.title == t]['id'].values
df2
# HOW TO GET THIS BY Advanced methods?
I was able to get the required table using for loop.
Is there a better way using groupby or any other advanced method?
You can simplify your code a bit, to include a pivot call at the end for efficiency:
df = pd.read_csv('colon_sep.txt', sep=':', header=None)
df.insert(2, 2, df.groupby(0).cumcount())
df = df.pivot(index=2, columns=0, values=1)
print(df)
0 title1 title2 title3 title4 title5
2
0 A1 A2 A3 A4 A5
1 B1 B2 B3 B4 B5
2 C1 C2 C3 C4 C5
3 D1 D2 D3 D4 D5
After you do
df = pd.read_csv('colon_sep.txt',header=None,sep=':')
You can do
df= pd.DataFrame({name:list(column[0]) for name,column in dfc.groupby(dfc.index)})
Or, if you have the data in a string called text, you can do
df = pd.DataFrame([[line.split(':')[1] for line in lines.split('\n')] for lines in text.split('\n\n')])`
You can get the column names with
df.columns = [line.split(':')[0] for line in text.split('\n\n')[0].split('\n')]

Pandas reshaping repeating rows

I want to reshape a dataframe with repeating rows. The data comes from a csv file where blocks of data are repeated.
As an example:
Name 1st 2nd
0 Value1 a1 b1
1 Value2 a2 b2
2 Value3 a3 b3
3 Value1 a4 b4
4 Value2 a5 b5
5 Value3 a6 b6
Shall be reshaped into:
Name 1st 2nd 3rd 4th
Value1 a1 b1 a4 b4
Value2 a2 b2 a5 b5
Value3 a3 b3 a6 b6
Do you have any suggestions how to do this?
I've already watched this thread, however I can not see how to translate this approach to my problem, where there is more than one column right of the column the groupby is worked on.
You can use set_index and stack to combine your two columns into one, cumcount to get the new column labels, and pivot to do the reshaping:
# Stack the 1st and 2nd columns, and use cumcount to get the new column labels.
df = df.set_index('Name').stack().reset_index(level=1, drop=True).to_frame()
df['new_col'] = df.groupby(level='Name').cumcount()
# Perform a pivot to get the desired shape.
df = df.pivot(columns='new_col', values=0)
# Formatting.
df = df.reset_index().rename_axis(None, 1)
The resulting output:
Name 0 1 2 3
0 Value1 a1 b1 a4 b4
1 Value2 a2 b2 a5 b5
2 Value3 a3 b3 a6 b6
Create a dataframe with repeated values of df after grouping by Name and merge that df with the original.
df1 = df.groupby('Name')['1st', '2nd'].apply(lambda x: x.iloc[1]).reset_index()
df1.columns = ['Name', '3rd', '4th']
df = df.drop_duplicates(subset=['Name']).merge(df1, on = 'Name')
You get
Name 1st 2nd 3rd 4th
0 Value1 a1 b1 a4 b4
1 Value2 a2 b2 a5 b5
2 Value3 a3 b3 a6 b6

Categories