Set specific rows values to column in pandas dataframe using list comprehension - python

I would like to change values in a column for specific rows. The rows are defined in a vector containing the row indexes. I thought this could be done with list comprehension (similar to apply function in R). I tried this:
[(dataframe.loc[(dataframe['id']== x), 'column_to_be_changed'] = 1) (x) for x in indexes]
The error message is "SyntaxError: invalid syntax" with pointer to " = 1 "
This part works:
x = n (e.g. 5)
dataframe.loc[(dataframe['id']== x), 'column_to_be_changed'] = 1)
Since a list comprehension gives a list back and not pandas dataframe, I am missing something, I guess. Help would be much appreciated. Thanks.

I think you are just looking for mask or where. See the below example:
df=pd.DataFrame({'id': [1,2,3,4], 'some_column': ['a','b','c','d']})
print(df)
# id some_column
# 0 1 a
# 1 2 b
# 2 3 c
# 3 4 d
li = [1,2] #indexes 1 and 2, so b and c
mask = df.index.isin(li)
df['some_column'].mask(mask, 'z', inplace=True) # 'z' is the value that will be set if the index is in 'li'
print(df)
# id some_column
# 0 1 a
# 1 2 z
# 2 3 z
# 3 4 d

Related

How to get a transition string per row object based on two different columns in python (without using loops)?

I have the following data structure:
The columns s and d are indicating the transition of object in column x. What I want to do is get a transition string per object present in the column x. For e.g. with a new column as follows:
Is there a lean way to do it using pandas, without using too many loops?
This was the code I tried:
obj = df['x'].tolist()
rows = []
for o in obj:
locs = df[df['x'] == o]['s'].tolist()
str_locs = '->'.join(str(l) for l in locs)
print(str_locs)
d = dict()
d['x'] = o
d['new'] = str_locs
rows.append(d)
tmp = pd.DataFrame(rows)
This give the output temp as:
x new
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
b 1->2
b 1->2
Example df:
df = pd.DataFrame({"x":["a","a","a","a","b","b"], "s":[1,2,4,8,5,11],"d":[2,4,8,9,11,12]})
print(df)
x s d
0 a 1 2
1 a 2 4
2 a 4 8
3 a 8 9
4 b 5 11
5 b 11 12
Following code will generate a transition string of all objects present in the column x.
groupby with respect to column x and get list of lists of s and d for every object available in x
Merge the list of lists sequentially
Remove consecutive duplicates from the merged list using itertools.groupby
Join the items of merged list with -> to make it a single string.
Finally map the series to column x of input df
from itertools import groupby
grp = df.groupby('x')[['s', 'd']].apply(lambda x: x.values.tolist())
grp = grp.apply(lambda x: [str(item) for tup in x for item in tup])
sr = grp.apply(lambda x: "->".join([i[0] for i in groupby(x)]))
df["new"] = df["x"].map(sr)
print(df)
x s d new
0 a 1 2 1->2->4->8->9
1 a 2 4 1->2->4->8->9
2 a 4 8 1->2->4->8->9
3 a 8 9 1->2->4->8->9
4 b 5 11 5->11->12
5 b 11 12 5->11->12

How to transform the result of a Pandas `GROUPBY` function to the original dataframe

Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4

Duplicating Pandas Dataframe rows based on string split, without iteration

I have a dataframe with a multiindex, where one of thecolumns represents multiple values, separated by a "|", like this:
value
left right
x a|b 2
y b|c|d -1
I want to duplicate the rows based on the "right" column, to get something like this:
values
left right
x a 2
x b 2
y b -1
y c -1
y d -1
The solution I have to this feels wrong and runs slow, because it's based on iteration:
df2 = df.iloc[:0]
for index, row in df.iterrows():
stgs = index[1].split("|")
for s in stgs:
row.name = (index[0], s)
df2 = df2.append(row)
Is there a more vectored way to do this?
Pandas Series have a dedicated method split to perform this operation
split works only on Series so isolate the Column you want
SO = df['right']
Now 3 steps at once: spilt return A Series of array. apply(pd.Series, 1) convert array in columns. stack stacks you columns into a unique column
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
The only issue is that you have now a multi-index. So just drop the level you don`t need
S1.index.droplevel(-1)
Full example
SO = pd.Series(data=["a,b", "b,c,d"])
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
S1
Out[4]:
0 0 a
1 b
1 0 b
1 c
2 d
S1.index = S1.index.droplevel(-1)
S1
Out[5]:
0 a
0 b
1 b
1 c
1 d
Building upon the answer #xNoK, I am adding here the additional step needed to include the result back in the original DataFrame.
We have this data:
arrays = [['x', 'y'], ['a|b', 'b|c|d']]
midx = pd.MultiIndex.from_arrays(arrays, names=['left', 'right'])
df = pd.DataFrame(index=midx, data=[2, -1], columns=['value'])
df
Out[17]:
value
left right
x a|b 2
y b|c|d -1
First, let's generate the values for right index as #xNoK suggested. First take the Index level we want to work on by index.levels[1] and convert it it to series so that we can perform the str.split() function, and finally stack() it to get the result we want.
new_multi_idx_val = df.index.levels[1].to_series().str.split('|').apply(pd.Series).stack()
new_multi_idx_val
Out[18]:
right
a|b 0 a
1 b
b|c|d 0 b
1 c
2 d
dtype: object
Now we want to put this value in the original DataFrame df. To do that, let's change its shape so that result we generated in the previous step could be copied.
In order to do that, we can repeat the rows (including the indexes) by a number of | present in right level of multi-index. df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)) gives the number of times a row (including index) should be repeated. We apply this to the function index.repeat() and fetch values at those indexes to create a new DataFrame df_repeted.
df_repeted = df.loc[df.index.repeat(df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)))]
df_repeted
Out[19]:
value
left right
x a|b 2
a|b 2
y b|c|d -1
b|c|d -1
b|c|d -1
Now df_repeted DataFrame is in a shape where we could change the index to get the answer we want.
Replace the index of df_repeted with desired values as following:
df_repeted.index = [df_repeted.index.droplevel(1), new_multi_idx_val]
df_repeted.index.rename(names=['left', 'right'], inplace=True)
df_repeted
Out[20]:
value
left right
x a 2
b 2
y b -1
c -1
d -1

How is index modified by dataframe.sort_values()

I have a dataframe that i want to sort on one of my columns (that is a date)
However I have a loop i am running on the index (while i<df.shape[0]), I need the loop to go on my dataframe once it is sorted by date.
Is the current index modified accordingly to the sorting or should I use df.reset_index() ?
Maybe I'm not understanding the question, but a simple check shows that sort_values does modify the index:
df = pd.DataFrame({'x':['a','c','b'], 'y':[1,3,2]})
df = df.sort_values(by = 'x')
Yields:
x y
0 a 1
2 b 2
1 c 3
And a subsequent:
df = df.reset_index(drop = True)
Yields:
x y
0 a 1
1 b 2
2 c 3

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

Categories