Python - Dataframe - Splitting and stacking string column - python

I have a dataframe that is generated by the following code:
data={'ID':[1,2,3],'String': ['xKx;yKy;zzz','-','z01;x04']}
frame=pd.DataFrame(data)
I would like to transform the frame dataframe into a dataframe that looks like this:
data_trans={'ID':[1,1,1,2,3,3],'String': ['xKx','yKy','zzz','-','z01','x04']}
frame_trans=pd.DataFrame(data_trans)
So, in words, I would like to have the elements of "String" in data split at the ";" and then stacked underneath each other in a new dataframe, and the associated ID should be duplicated accordingly. Splitting, of course, in principle, is not hard, but I am having trouble with the stacking.
I would appreciate it if you could offer me some hints on how to approach this in Python. Many thanks!!

I'm not sure this is the best way to do this but here is a working approach:
data={'ID':[1,2,3],'String': ['xKx;yKy;zzz','-','z01;x04']}
frame=pd.DataFrame(data)
print(frame)
data_trans={'ID':[1,1,1,2,3,3],'String': ['xKx','yKy','zzz','-','z01','x04']}
frame_trans=pd.DataFrame(data_trans)
print(frame_trans)
frame2 = frame.set_index('ID')
# This next line does almost all the work.This can be very memory intensive.
frame3 = frame2['String'].str.split(';').apply(pd.Series, 1).stack().reset_index()[['ID', 0]]
frame3.columns = ['ID', 'String']
print(frame3)
# Verbose version
# Setting the index makes it easy to have the index column be repeated for each value later
frame2 = frame.set_index('ID')
print("frame2")
print(frame2)
#Make one column for each of the values in the multi-value columns
frame3a = frame2['String'].str.split(';').apply(pd.Series, 1)
print("frame3a")
print(frame3a)
# Convert from a wide-data format to a long-data format
frame3b = frame3a.stack()
print("frame3b")
print(frame3b)
# Get only the columns we care about
frame3c = frame3b.reset_index()[['ID', 0]]
print("frame3c")
print(frame3c)
# The columns we have have the wrong titles. Let's fix that
frame3d = frame3c.copy()
frame3d.columns = ['ID', 'String']
print("frame3d")
print(frame3d)
Output:
frame2
String
ID
1 xKx;yKy;zzz
2 -
3 z01;x04
frame3a
0 1 2
ID
1 xKx yKy zzz
2 - NaN NaN
3 z01 x04 NaN
frame3b
ID
1 0 xKx
1 yKy
2 zzz
2 0 -
3 0 z01
1 x04
dtype: object
frame3c
ID 0
0 1 xKx
1 1 yKy
2 1 zzz
3 2 -
4 3 z01
5 3 x04
frame3d
ID String
0 1 xKx
1 1 yKy
2 1 zzz
3 2 -
4 3 z01
5 3 x04

Related

Sort column names using wildcard using pandas

I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]
Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining
The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5

Pandas: Select column by location and rows by value

I have a dataframe where one of the column names is a variable:
xx = pd.DataFrame([{'ID':1, 'Name': 'Abe', 'HasCar':1},
{'ID':2, 'Name': 'Ben', 'HasCar':0},
{'ID':3, 'Name': 'Cat', 'HasCar':1}])
ID Name HasCar
0 1 Abe 1
1 2 Ben 0
2 3 Cat 1
In this dummy example column 2 could be "HasCar", or "IsStaff", or some other unknowable value. I want to select all rows, where column 2 is True, whatever the column name is.
I've tried the following without success:
xx.iloc[:,[2]] == 1
HasCar
0 True
1 False
2 True
and then trying to use that as an index results in:
xx[xx.iloc[:,[2]] == 1]
ID Name HasCar
0 NaN NaN 1.0
1 NaN NaN NaN
2 NaN NaN 1.0
Which isn't helpful. I suppose I could go about renaming column 2 but that feels a little wrong. The issue seems to be that xx.iloc[:,[2]] returns a dataframe while xx['hasCar'] returns a series. I can't figure out how to force a (x,1) shaped dataframe into a series without knowing the column name, as described here .
Any ideas?
It was almost correct, but you sliced in 2D, use a Series slicing instead:
xx[xx.iloc[:, 2] == 1]
Output:
ID Name HasCar
0 1 Abe 1
2 3 Cat 1
difference:
# 2D slicing, this gives a DataFrame (with a single column)
xx.iloc[:,[2]]
HasCar
0 1
1 0
2 1
# 1D slicing, as Series
xx.iloc[:,2]
0 1
1 0
2 1
Name: HasCar, dtype: int64

Python Pandas - How to get group by counts by values from multiple columns with multiple values

My data includes a few variables holding data from multi-answer questions. These are stored as string (comma separated) and aren't ordered by value.
I need to run different counts across 2 or more of these variables at the same time, i.e. get the frequencies of each combination of their unique values.
I also have a second dataframe with the available codes for each variable
df_meta['a']['Categories'] = ['1', '2', '3','4']
df_meta['b']['Categories'] = ['1', '2']
If this is my data
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],["3,1","2,1"]]),
columns=['a', 'b'])
index a b
1 1,3 1
2 3 1,2
3 1,3,2 1
4 3,1 2,1
Ideally, this is what the output would look like
a b count
1 1 3
1 2 1
2 1 1
2 2 0
3 1 4
3 2 2
4 1 0
4 2 0
Although if I it's not possible to get the zero-counts, this would be just fine
a b count
1 1 3
1 2 1
2 1 1
3 1 4
3 2 2
So far, I got the counts for each of these variables individually, by using split and value_counts
df["a"].str.split(',',expand=True).stack().value_counts()
3 4
1 3
2 1
df["b"].str.split(',',expand=True).stack().value_counts()
1 4
2 2
But I can't figure how to group by them, because of the differences in the indexes.
df2 = pd.DataFrame()
df2["a"] = df["a"].str.split(',',expand=True).stack()
df2["b"] = df["b"].str.split(',',expand=True).stack()
df2.groupby(['a','b']).size()
a b
1 1 3
3 1 1
2 1
Is there a way to adjust the groupby to only count the instances of the first index or another way to count the unique combinations more efficiency?
I can alternatively iterate through all codes using the df_meta dataframe, but some of the actual variables have 300-400 codes and it's very slow, when I try to cross 2-3 of them and, if it's possible to use groupby or another function, it should work much faster.
First we make your dataframe to start with.
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],
["3,1","2,1"]]),columns=['a', 'b'])
Then split columns to separate dataframes.
da = df["a"].str.split(',',expand=True)
db = df["b"].str.split(',',expand=True)
Loop through all rows and both dataframes. Make temporary dataframes of all compinations and add them to a list.
ab = list()
for r in range(len(da)):
for i in da.iloc[r,:]:
for j in db.iloc[r,:]:
if i != None and j != None:
daf = pd.DataFrame({'a':[i], 'b':[j]})
ab.append(daf)
Concatenate list of temporary dataframes into one new dataframe.
dfn = pd.concat(ab)
Groupby with 'a' and 'b' columns and size() gives you the answer.
print(dfn.groupby(['a', 'b']).size().reset_index(name='count'))
a b count
0 1 1 3
1 1 2 1
2 2 1 1
3 3 1 4
4 3 2 2

Reduction the Pandas dataframe to other dataframe

I have two data frames and their shapes are (707,140) and (34,98).
I want to minimize the bigger data frame to the small one based on the same index name and column names.
So after the removing additional rows and columns from bigger data frame, in the final its shape should be (34,98) with the same index and columns with the small data frame.
How can I do this in python ?
I think you can select by loc index and columns of small DataFrame:
dfbig.loc[dfsmall.index, dfsmall.columns]
Sample:
dfbig = pd.DataFrame({'a':[1,2,3,4,5], 'b':[4,7,8,9,4], 'c':[5,0,1,2,4]})
print (dfbig)
a b c
0 1 4 5
1 2 7 0
2 3 8 1
3 4 9 2
4 5 4 4
dfsmall = pd.DataFrame({'a':[4,8], 'c':[0,1]})
print (dfsmall)
a c
0 4 0
1 8 1
print (dfbig.loc[dfsmall.index, dfsmall.columns])
a c
0 1 5
1 2 0

Transform pandas timeseries into timeseries with non-date index

I'm trying to generate a timeseries from a dataframe, but the solutions I've found here don't really address my specific problem. I have a dataframe which is a series of id's which iterate from 1 to n, then repeat, like this:
key ID Var_1
0 1 1
0 2 1
0 3 2
1 1 3
1 2 2
1 3 1
I want to reshape it into a timeseries in which the index
ID Var_1_0 Var_2_0
1 1 3
2 1 2
3 2 1
I have tried the stack() method but it doesn't generate the result I want. Generating an index from ID seems to be the right ID is not a proper date so I'm not sure how to proceed. Pointers much appreciated.
Try this:
import pandas as pd
df = pd.DataFrame([[0,1,1], [0,2,1], [0,3,2], [1,1,3], [1,2,2], [1,3,1]], columns=('key', 'ID', 'Var_1'))
Use the pivot function:
df2 = df.pivot('ID', 'key', 'Var_1')
You can rename the columns by:
df2.columns = ('Var_1_0', 'Var_2_0')
Result:
Out:
Var_1_0 Var_2_0
ID
1 1 3
2 1 2
3 2 1

Categories