How to split DataFrame columns into multiple rows? - python

I am trying to convert multiple columns to multiple rows. Can someone please offer some advice?
I have DataFrame:
id . values
1,2,3,4 [('a','b'), ('as','bd'),'|',('ss','dd'), ('ws','ee'),'|',('rr','rt'), ('tt','yy'),'|',('yu','uu'), ('ii','oo')]
I need it to look like this:
ID Values
1 ('a','b'), ('as','bd')
2 ('ss','dd'), ('ws','ee')
3 ('rr','rt'), ('tt','yy')
4 ('yu','uu'), ('ii','oo')
I have tried groupby, split, izip. Maybe I am not doing it the right way?

I made a quick and dirty example how you could parse this dataframe
# example dataframe
df = [
"1,2,3,4",
[('a','b'), ('as','bd'), '|', ('ss','dd'), ('ws','ee'), '|', ('rr','rt'), ('tt','yy'), '|', ('yu','uu'), ('ii','oo')]
]
# split ids by comma
ids = df[0].split(",")
# init Id and Items as int and dict()
Id = 0
Items = dict()
# prepare array for data insert
for i in ids:
Items[i] = []
# insert data
for i in df[1]:
if isinstance(i, (tuple)):
Items[ids[Id]].append(i)
elif isinstance(i, (str)):
Id += 1
# print data as written in stackoverflow question
print("id .\tvalues")
for item in Items:
print("{}\t{}".format(item, Items[item]))

I came up with a quite concise solution, based on multi-level grouping,
which in my opinion is to a great extent pandasonic.
Start from defining the following function, "splitting" a Series taken
from individual values element into a sequence of lists representations,
without surrounding [ and ]. The splitting occurs at each '|' element.:
def fn(grp1):
grp2 = (grp1 == '|').cumsum()
return grp1[grp1 != '|'].groupby(grp2).apply(lambda x: repr(list(x))[1:-1])
(will be used a bit later).
The first step of processing is to convert id column into a Series:
sId = df.id.apply(lambda x: pd.Series(x.split(','))).stack().rename('ID')
For your data the result is:
0 0 1
1 2
2 3
3 4
Name: ID, dtype: object
The first level of MultiIndex is the index of the source row and the second
level are consecutive numbers (within the current row).
Now it's time to perform similar conversion of values column:
sVal = pd.DataFrame(df['values'].values.tolist(), index= df.index)\
.stack().groupby(level=0).apply(fn).rename('Values')
The result is:
0 0 ('a', 'b'), ('as', 'bd')
1 ('ss', 'dd'), ('ws', 'ee')
2 ('rr', 'rt'), ('tt', 'yy')
3 ('yu', 'uu'), ('ii', 'oo')
Name: Values, dtype: object
Note that the MultiIndex above has the same structure as in the case of sId.
And the last step is to concat both these partial results:
result = pd.concat([sId, sVal], axis=1).reset_index(drop=True)
The result is:
ID Values
0 1 ('a', 'b'), ('as', 'bd')
1 2 ('ss', 'dd'), ('ws', 'ee')
2 3 ('rr', 'rt'), ('tt', 'yy')
3 4 ('yu', 'uu'), ('ii', 'oo')

Related

Python dataframe, move record to another column if it contains specific value

I have the following data:
For example in row 2, I want to move all the "3:xxx" to column 3, and all the "4:xxx" to column 4. How can I do that?
Btw, I have tried this but it doest work:
df[3] = np.where((df[2].str.contains('3:')))
Dataset loading:
url = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale'
df = pd.read_csv(url,header=None,delim_whitespace=True)
I think the easiest thing to do would be to cleanse the data set before reading it into a dataframe. In looking at the data source, it looks like there are some rows with missing fields, IE:
# (Missing the 3's field)
'1 1:-0.611111 2:0.166667508 4:-0.916667'
So I would clean up the file before reading it. For this line, you could stick an extra space between 2:0.166667508 and 4:-0.916667 to denote a null 3rd column:
'1 1:-0.611111 2:0.166667508 4:-0.916667 '.split(' ')
# ['1', '1:-0.611111', '2:0.166667508', '4:-0.916667', '']
'1 1:-0.611111 2:0.166667508 4:-0.916667 '.split(' ')
# ['1', '1:-0.611111', '2:0.166667508', '', '4:-0.916667', '']
I agree with Greg's suggestion of cleansing the data set before reading it into a dataframe but still if you want to have shift on unmatched column with values then you may try this below one.
input.csv
1,1:-0.55,2:0.25,3:-0.86,4:-91
1,1:-0.57,2:0.26,3:-0.87,4:-0.92
1,1:-0.57,3:-0.89,4:-0.93,NaN
1,1:-0.58,2:0.25,3:-0.88,4:-0.99
Shift at particular index code
import pandas as pd
df = pd.read_csv('files/60009536-input.csv')
print(df)
for col_num in df.columns:
if col_num > '0': # Assuming there is no problem at index column 0
for row_val in df[col_num]:
if row_val != 'nan':
if col_num != row_val[:1]: # Comparing column number with sliced value
row = df[df[col_num] == row_val].index.values # on true get row index as we already know column #
print("Found at column {0} and row {1}".format(col_num, row))
r_value = df.loc[row, str(row_val[:1])].values # capturing value on target location
print("target location value", r_value)
# print("target location value", r_value[0][:1])
df.at[row, str(r_value[0][:1])] = r_value # shifting target location's value to its correct loc
df.at[row, str(row_val[:1])] = row_val # Shift to appropriate column
df.at[row, col_num] = 'NaN' # finally update that cell to NaN
print(df)
output:
0 1 2 3 4
0 1 1:-0.55 2:0.25 3:-0.86 4:-91
1 1 1:-0.57 2:0.26 3:-0.87 4:-0.92
2 1 1:-0.57 3:-0.89 4:-0.93 NaN
3 1 1:-0.58 2:0.25 3:-0.88 4:-0.99
Found at column 2 and row [2]
target location value ['4:-0.93']
0 1 2 3 4
0 1 1:-0.55 2:0.25 3:-0.86 4:-91
1 1 1:-0.57 2:0.26 3:-0.87 4:-0.92
2 1 1:-0.57 NaN 3:-0.89 4:-0.93
3 1 1:-0.58 2:0.25 3:-0.88 4:-0.99
Process finished with exit code 0

Using dictionary to add some columns to a dataframe with assign function

I was using python and pandas to do some statistical analysis on data and at some point I needed to add some new columns with assign function
df_res = (
df
.assign(col1 = lambda x: np.where(x['event'].str.contains('regex1'),1,0))
.assign(col2 = lambda x: np.where(x['event'].str.contains('regex2'),1,0))
.assign(mycol = lambda x: np.where(x['event'].str.contains('regex3'),1,0))
.assign(newcol = lambda x: np.where(x['event'].str.contains('regex4'),1,0))
)
I wanted to know if there is any way to add columns names and my regex to a dictionary and use a for loop or another lambda expression to assign these columns automatically:
Dic = {'col1':'regex1','col2':'regex2','mycol':'regex3','newcol':'regex4'}
df_res = (
df
.assign(...using Dic here...)
)
I need to add more columns later and I think it will make it easier to add new columns later.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html
Assigning multiple columns within the same assign is possible. For Python 3.6 and above, later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order. For Python 3.5 and below, the order of keyword arguments is not specified, you cannot refer to newly created or modified columns. All items are computed first, and then assigned in alphabetical order.
Changed in version 0.23.0: Keyword argument order is maintained for Python 3.6 and later.
If you map all your regex so that each dictionary value holds a lambda instead of just the regex, you can simply unpack the dic into assign:
lambda_dict = {
col:
lambda x, regex=regex: (
x['event'].
str.contains(regex)
.astype(int)
)
for col, regex in Dic.items()
}
res = df.assign(**lambda_dict)
EDIT
Here's an example:
import pandas as pd
import random
random.seed(0)
events = ['apple_one', 'chicken_one', 'chicken_two', 'apple_two']
data = [random.choice(events) for __ in range(10)]
df = pd.DataFrame(data, columns=['event'])
regex_dict = {
'apples': 'apple',
'chickens': 'chicken',
'ones': 'one',
'twos': 'two',
}
lambda_dict = {
col:
lambda x, regex=regex: (
x['event']
.str.contains(regex)
.astype(int)
)
for col, regex in regex_dict.items()
}
res = df.assign(**lambda_dict)
print(res)
# Output
event apples chickens ones twos
0 apple_two 1 0 0 1
1 apple_two 1 0 0 1
2 apple_one 1 0 1 0
3 chicken_two 0 1 0 1
4 apple_two 1 0 0 1
5 apple_two 1 0 0 1
6 chicken_two 0 1 0 1
7 apple_two 1 0 0 1
8 chicken_two 0 1 0 1
9 chicken_one 0 1 1 0
The problem with the prior code was that the regex was only evaluated during the last loop. Adding it as a default argument fixes this.
This can do what you want to do
pd.concat([df,pd.DataFrame({a:list(df["event"].str.contains(b)) for a,b in Dic.items()})],axis=1)
Actually using a for loop will do the same
If I understand you question correctly, you're trying to rename the columns, in which case I think you could just use Pandas rename function. This would look like
df_res = df_res.rename(mapper=Dic)
-Ben

Python pandas split column with NaN values

Hello my dear coders,
I'm new to coding and I've stumbled upon a problem. I want to split a column of a csv file that I have imported via pandas in Python. The column name is CATEGORY and contains 1, 2 or 3 values such seperated by a comma (IE: 2343, 3432, 4959) Now I want to split these values into seperate columns named CATEGORY, SUBCATEGORY and SUBSUBCATEGORY.
I have tried this line of code:
products_combined[['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY']] = products_combined.pop('CATEGORY').str.split(expand=True)
But I get this error: ValueError: Columns must be same length as key
Would love to hear your feedback <3
You need:
pd.DataFrame(df.CATEGORY.str.split(',').tolist(), columns=['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY'])
Output:
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 2343 3432 4959
I think this could be accomplished by creating three new columns and assigning each to a lambda function applied to the 'CATEGORY' column. Like so:
products_combined['SUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[1] if len(original) > 1 else None)
products_combined['SUBSUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[2] if len(original) > 2 else None)
products_combined['CATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[0])
The apply() method called on a series returns a new series that contains the result of running the passed function (in this case, the lambda function) on each row of the original series.
IIUC, use split and then Series:
(
df[0].apply(lambda x: pd.Series(x.split(",")))
.rename(columns={0:"CATEGORY", 1:"SUBCATEGORY", 2:"SUBSUBCATEGORY"})
)
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 1 NaN NaN
2 44 55 NaN
Data:
d = [["2343,3432,4959"],["1"],["44,55"]]
df = pd.DataFrame(d)

How to sort a alphanumeric filed in pandas?

I have a dataframe and the first column contains id. How do I sort the first column when it contains alphanumeric data, such as:
id = ["6LDFTLL9", "N9RFERBG", "6RHSDD46", "6UVSCF4H", "7SKDEZWE", "5566FT6N","6VPZ4T5P", "EHYXE34N", "6P4EF7BB", "TT56GTN2", "6YYPH399" ]
Expected result is
id = ["5566FT6N", "6LDFTLL9", "6P4EF7BB", "6RHSDD46", "6UVSCF4H", "6VPZ4T5P", "6YYPH399", "7SKDEZWE", "EHYXE34N", "N9RFERBG", "TT56GTN2" ]
You can utilize the .sort() method:
>>> id.sort()
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
This will sort the list in place. If you don't want to change the original id list, you can utilize the sorted() method
>>> sorted(id)
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
>>> id
['6LDFTLL9', 'N9RFERBG', '6RHSDD46', '6UVSCF4H', '7SKDEZWE', '5566FT6N', '6VPZ4T5P', 'EHYXE34N', '6P4EF7BB', 'TT56GTN2', '6YYPH399']
Notice, with this one, that id is unchanged.
For a DataFrame, you want to use sort_values().
df.sort_values(0, inplace=True)
0 is either the numerical index of your column or you can pass the column name (eg. id)
0
5 5566FT6N
0 6LDFTLL9
8 6P4EF7BB
2 6RHSDD46
3 6UVSCF4H
6 6VPZ4T5P
10 6YYPH399
4 7SKDEZWE
7 EHYXE34N
1 N9RFERBG
9 TT56GTN2

Pandas joining based on date

I'm trying to join two dataframes with dates that don't perfectly match up. For a given group/date in the left dataframe, I want to join the corresponding record from the right dataframe with the a date just before that of the left dataframe. Probably easiest to show with an example.
df1:
group date teacher
a 1/10/00 1
a 2/27/00 1
b 1/7/00 1
b 4/5/00 1
c 2/9/00 2
c 9/12/00 2
df2:
teacher date hair length
1 1/1/00 4
1 1/5/00 8
1 1/30/00 20
1 3/20/00 100
2 1/1/00 0
2 8/10/00 50
Gives us:
group date teacher hair length
a 1/10/00 1 8
a 2/27/00 1 20
b 1/7/00 1 8
b 4/5/00 1 100
c 2/9/00 2 0
c 9/12/00 2 50
Edit 1:
Hacked together a way to do this. Basically I iterate through every row in df1 and pick out the most recent corresponding entry in df2. It is insanely slow, surely there must be a better way.
One way to do this is to create a new column in the left data frame, which will (for a given row's date) determine the value that is closest and earlier:
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
then a regular join or merge between 'join_date' on the left and 'date' on the right will work. You may need to tweak the function to handle Null values or other corner cases.
This is not very efficient (you are searching the right-hand dates over and over). A more efficient approach is to sort both data frames by the dates, iterate through the left-hand data frame, and consume entries from the right hand data frame just until the date is larger:
# Assuming df1 and df2 are sorted by the dates
df1['hair length'] = 0 # initialize
r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)
for i, l_row in df1.iterrows():
cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2
while cur_r_row['date'] <= l_row['date']:
cur_hair_length = cur_r_row['hair length']
try:
_, cur_r_row = next(r_generator)
except StopIteration:
break
df1.loc[i, 'hair length'] = cur_hair_length
Seems like the quickest way to do this is using sqlite via pysqldf:
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):
try:
tablea_group, tablea_date = tablea_keys
tableb_group, tableb_date = tableb_keys
except ValueError, e:
raise(e, 'Need to pass in both a group and date key for both tables')
# Note: can't actually use group here as a field name due to sqlite
statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
MAX(tableb.{date_b}) AS tdate
FROM tablea
JOIN tableb
ON tablea.{group_a}=tableb.{group_b}
AND tablea.{date_a}>=tableb.{date_b}
GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
) AS a
JOIN tableb b
ON a.{group_a}=b.{group_b}
AND a.tdate=b.{date_b};
""".format(group_a=tablea_group, date_a=tablea_date,
group_b=tableb_group, date_b=tableb_date,
temp_date='join_date', base_id=base_id)
# Note: you lose types here for tableb so you may want to save them
pre_join_tableb = sqldf(statement, locals())
return pd.merge(tablea, pre_join_tableb, how='inner',
left_on=['group'] + tablea_keys,
right_on=['group', tableb_group, 'join_date'])

Categories