Replicate a row based on a conditional - python

I've read through at least 10 answers to very similar questions but none of them work and/or are quite what I need. I have a large-ish dataframe where I need it to find a particular row and create a copy of the entire row. So for example:
Before:
index price quantity flavor
0 1.45 6 vanilla
1 1.85 3 berry
2 2.25 2 double chocolate
After:
index price quantity flavor
0 1.45 6 vanilla
1 1.85 3 berry
2 2.25 2 double chocolate
3 1.85 3 berry
What would seem to work based on my knowledge of pandas and python is this:
df.loc[df.index.max() + 1,:] = df.loc[df['flavor'] == 'berry'].values
However I get this error:
ValueError: setting an array element with a sequence.
Bear in mind that I have no idea where in the database "berry" might be (other than it will be in the "flavor" column). (edit to add) Also there may be more than one "berry" so it would need to find them all.
Thoughts?

So, this is probably what you want:
import pandas as pd
df = pd.DataFrame({"quantity":[6, 3, 2], "flavor":["vanilla", "berry", "double chocolate"], "price":[1.45, 1.85, 2.25]})
df = df.append(df.loc[df['flavor'] == 'berry']).reset_index()
df
#output
flavor price quantity
0 vanilla 1.45 6
1 berry 1.85 3
2 double chocolate 2.25 2
3 berry 1.85 3
Just using append and resetting index should do it.

I came up with a slightly different answer than what #user2906838 suggested. Because it is possible that there is more than one 'berry' in the dataframe, I created a new dataframe and then concatenated them:
import pandas as pd
df = pd.DataFrame({'quantity':[6, 3, 2], 'flavor':['vanilla', 'berry', 'double chocolate'], 'price':[1.45, 1.85, 2.25]})
df_flavor = pd.DataFrame
df_flavor.append(df.loc[df['flavor'] == 'berry'], sort = False)
df = pd.concat([df, df_flavor], sort = False, ignore_index = True)
This worked fine, but would love to hear if there are other solutions!

Related

Most efficient way to transform this data using Pandas?

I currently have several hundred .csv files in the format shown on the left below, and I need to transform them all into the format shown on the right. I tried to highlight the blocks of data to make it easier to see what I'm trying to do.
Is there an efficient way to do this using Pandas? I was trying to formulate something using df.iteritems() but couldn't think of a good way to do it.
Given:
Date-Time L-A
0 5/1/2022 0:00 1.4
1 5/1/2022 0:05 1.4
2 5/2/2022 0:10 1.4
Doing:
name = df.columns[1]
df['x'] = name
df = df.reindex(columns=['x', 'Date-Time', name])
print(df.values)
Output:
[['L-A VLX' '5/1/2022 0:00' 1.4]
['L-A VLX' '5/1/2022 0:05' 1.4]
['L-A VLX' '5/2/2022 0:10' 1.4]]
My beginner's level way.
Slicing by column index location, later adding them together using concat:
print(df) # initial data frame
DateTime Units DateTime Units
0 a 1 a111 10
1 b 2 b222 20
2 c 3 c333 30
Slicing by column index location as the initial DF has duplicated headers:
df2 = df.iloc[: , [0, 1]].copy()
df3 = df.iloc[: , [2, 3]].copy()
# adding all back into new DF
df_result = pd.concat([df2,df3]).reset_index(drop=True)
print(df_result)
output:
DateTime Units
0 a 1
1 b 2
2 c 3
3 a111 10
4 b222 20
5 c333 30

Create a new column in a dataframe and add 1 to the previous row of that column

I am looking to derive a new row from a current row in my dataframe, and add 1 to the previous row to keep a kind of running total
df['Touch_No'] = np.where((df.Time_btween_steps.isnull()) | (df.Time_btween_steps > 30), 1, df.First_touch.shift().add(1))
I basically want to check if the column value is null, if it is then set that to "First Activity"/resets the counter, if not, add 1 to the "previous activity", to give me a running total of the number of outreach we are doing on specific people:
Expected outcome:
Time Between Steps | Touch_No
Null. |. 1
0 |. 2
5.4 |. 3
6.7 |. 4
2 |. 5
null |. 1
1 |. 2
Answer using this. Combo of cumsum(), groupBy(), and cumcount()
df = pd.DataFrame(data=[None, 0, 5.4, 6.7, 2, None, 1], columns=['Time_btween_steps'])
df['Touch_No'] = np.where((df.Time_btween_steps.isnull()), (df.Time_btween_steps > 30), 1)
df['consec'] = df['Touch_No'].groupby((df['Touch_No']==0).cumsum()).cumcount()
df.head(10)
Edited according to your clarification:
df = pd.DataFrame(data=np.array(([None, 0, 5.4, 6.7, 2, None, 1],[50,1,2,3,4,35,1])).T, columns=['Time_btween_steps', 'Touch_No'])
mask = pd.isna(df['Time_btween_steps']) | df['Time_btween_steps']>30
df['Touch_No'][~mask] += 1
df['Touch_No'][mask] = 1
Returns:
Time_btween_steps Touch_No
0 None 51
1 0 2
2 5.4 3
3 6.7 4
4 2 5
5 None 36
6 1 2
In my opinion a solution like this is much more readable. We increment by 1 where the condition is not met, and we set the ones where the condition is true to 1. You can combine these into a single line if you wish.
Old answer for posterity.
Here is a simple solution using pandas apply functionality which takes a function.
import pandas as pd
df = pd.DataFrame(data=[1,2,3,4,None,5,0],columns=['test'])
df.test.apply(lambda x: 0 if pd.isna(x) else x+1)
Which returns:
0 2.0
1 3.0
2 4.0
3 5.0
4 0.0
5 6.0
6 1.0
Here I wrote the function in place but if you have more complicated logic, such as resetting if the number is something else, etc., you can write a custom function and pass it in instead of the lambda function. This is not the only way to do it, but if your data frame isn't huge (hundreds of thousands of rows), it should be performant. If you don't want a copy but to overwrite the array simply assign it back by prepending:
df['test'] = before the last line.
If you want the output to be ints, you can also do:
df['test'].astype(int) but be careful about converting None/Null to int.
Using np.where, index values with ffill for partitioning and simple rank:
import numpy as np
import pandas as pd
sodf = pd.DataFrame({'time_bw_steps': [None, 0, 5.4, 6.7, 2, None, 1]})
sodf['touch_partition'] = np.where(sodf.time_bw_steps.isna(), sodf.index, np.NaN)
sodf['touch_partition'] = sodf['touch_partition'].fillna(method='ffill')
sodf['touch_no'] = sodf.groupby('touch_partition')['touch_partition'].rank(method='first', ascending='False')
sodf.drop(columns=['touch_partition'], axis='columns', inplace=True)
sodf

How to create a Dataframe from one Series, when it is not as simple as to transpose the object?

I've seen many similar questions here, but none of them applies to the case I need to solve. It happens that I have a products Series in which the names of the "future" columns end with the string [edit] and are mixed with the values that are going to be join in them. Something like this:
Index Values
0 Soda [edit]
1 Coke
2 Sprite
3 Ice Cream [edit]
4 Nestle
5 Snacks [edit]
6 Lays
7 Act II
8 Nachos
I need to turn this into a df, to get sth like:
Soda Ice Cream Snacks
0 Coke Nestle Lays
1 Sprite NaN Act II
2 NaN NaN Nachos
I made a Series called cols_index, which saves the index of the columns as in the first series:
Index Values
0 Soda [edit]
3 Ice Cream [edit]
5 Snacks [edit]
However, from here I don't know how to get to pass the values to the columns. As I'm new to pandas I thought in iterating using a for loop generating ranges which would refer to the elements' indexes ([1,2], [4], [6:8]), but that wouldn't be a pandorable way to do things.
How can I do this? Thanks in advance.
=========================================================
EDIT: I solved it, here's how I did it.
After reviewing the problem with a colleague, we concluded that there's no pandorable way to do it and therefore, I had to use the data as a list and apply for and if loops:
products = pd.read_csv("products_file.txt", delimiter='\n', header = None, squeeze = True)
product_list = products.values.tolist()
cols = products[products.str.contains('\[edit\]', case = False)].values.tolist() # List of elements to be columns
df = []
category = product_list[0]
for item in product_list:
if item in cols:
category = item[:-6] # Removes '[edit]'
else:
df.append((category, item))
df = pd.DataFrame(df, columns = ['Category', 'Product'])
We do isin find the column name , then with cumsum and cumcount create the pivot key then do crosstab
s=df1.Values.isin(df2.Values)
df=pd.crosstab(index=s.cumsum(),
columns=s.groupby(s.cumsum()).cumcount(),
values=df1.Values,
aggfunc='first').set_index(0).T
0 Soda IceCream Snacks
col_0
1 Coke Nestle Lays
2 Sprite NaN ActII
3 NaN NaN Nachos

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Why am I getting an empty row in my dataframe after using pandas apply?

I'm fairly new to Python and Pandas and trying to figure out how to do a simple split-join-apply. The problem I am having is that I am getting an blank row at the top of all the dataframes I'm getting back from Pandas' apply function and I'm not sure why. Can anyone explain?
The following is a minimal example that demonstrates the problem, not my actual code:
sorbet = pd.DataFrame({
'flavour': ['orange', 'orange', 'lemon', 'lemon'],
'niceosity' : [4, 5, 7, 8]})
def calc_vals(df, target) :
return pd.Series({'total' : df[target].count(), 'mean' : df[target].mean()})
sorbet_grouped = sorbet.groupby('flavour')
sorbet_vals = sorbet_grouped.apply(calc_vals, target='niceosity')
if I then do print(sorted_vals) I get this output:
mean total
flavour <--- Why are there spaces here?
lemon 7.5 2
orange 4.5 2
[2 rows x 2 columns]
Compare this with print(sorbet):
flavour niceosity <--- Note how column names line up
0 orange 4
1 orange 5
2 lemon 7
3 lemon 8
[4 rows x 2 columns]
What is causing this discrepancy and how can I fix it?
The groupby/apply operation returns is a new DataFrame, with a named index. The name corresponds to the column name by which the original DataFrame was grouped.
The name shows up above the index. If you reset it to None, then that row disappears:
In [155]: sorbet_vals.index.name = None
In [156]: sorbet_vals
Out[156]:
mean total
lemon 7.5 2
orange 4.5 2
[2 rows x 2 columns]
Note that the name is useful -- I don't really recommend removing it. The name allows you to refer to that index by name rather than merely by number.
If you wish the index to be a column, use reset_index:
In [209]: sorbet_vals.reset_index(inplace=True); sorbet_vals
Out[209]:
flavour mean total
0 lemon 7.5 2
1 orange 4.5 2
[2 rows x 3 columns]

Categories