So I have a dataframe where I want to count all the days a student was present. The dataframe headers are the days of the month and I want to count the frequency of the character 'P' row wise over all the columns and store them in a new column. What I have done unti now is defined a function which should accept each row and count the frequency of P -
def count_P(list):
frequency = 0
for item in list:
if item == 'P':
frequency += 1
return frequency
And then I am trying to apply this function which is what I am confused about:
df['Attendance'] = df.apply(lambda x: count_P(x) for x in , axis = 1)
In the above line I need to pass x everytime as a row of the dataframe so do I write
for x in range(df.iloc[0],df.iloc[df.shape[0]])? But that gives me SyntaxError. And do I need axis here? Or does it need to be done in some other way?
Edit:
The error message I am getting-
df['Attendance'] = df.apply(lambda x: count_P(x) for x in range(df.iloc[0],df.iloc[df.shape[0]]),axis=1)
^
SyntaxError: Generator expression must be parenthesized
Assuming your dataframe looks like this:
df = pd.DataFrame({'2021-03-01': ['P','P'], '2021-03-02': ['P','X']})
You can do :
df["p_count"] = (df == 'P').sum(axis=1)
yields:
2021-03-01 2021-03-02 p_count
0 P P 2
1 P X 1
Related
I have an dataframe as given:
Now as seen from the dataframe the player x lost 6 times and the player y lost 9 times. I want to make a dataframe which enlists the players and how many times they have lost. Hence the final dataframe should look like this
One option which I found was to use dataframe.apply wherein I could return the no of rows which maintained the condition. The code for the same is:
import pandas as pd
df=pd.read_csv('C:/Users/sadik/Desktop/Data.csv')
df = df.apply(lambda x : True
if x['Result'] == "Lost" else False, axis = 1)
num_rows = len(df[df == True].index)
print('Number of Rows in dataframe in which Condition is met: ',
num_rows )
This gives me the ouput
runfile('C:/Users/sadik/untitled0.py', wdir='C:/Users/sadik')
Number of Rows in dataframe in which Condition is met: 15
My question is how do I use the same logic to output the count by the name of the players as shown in the expected ouput
You can group by the Name and then aggregate the total number of Losts:
out = df.groupby("Name").Result.agg(lambda s: s.eq("Lost").sum()).to_frame("Count")
where we lastly turn it to a dataframe with the counts named "Count",
to get
>>> out
Count
Name
x 6
y 9
You can just use groupby:
import pandas as pd
df = pd.DataFrame({"name": ['x'] * 6 + ['y'] * 9, 'result': ['LOST'] * 15})
df.groupby('name').count()
>>>
result
name
x 6
y 9
In case you want to use apply you can do that too:
df.groupby('name')[['result']].apply(lambda x: x.count())
Assuming a pandas dataframe like the one in the picture, I would like to fill the na values based with the value of the other variable similar to it. To be more clear, my variables are
mean_1, mean_2 .... , std_1, std_2, ... min_1, min_2 ...
So I would like to fill the na values with the values of the other columns, but not all the columns, only those whose represent the same metric, in the picture i highligted 2 na values. The first one I would like to fill it with the mean obtain from the variables 'MEAN' at row 2, while the second na I would like to fill it with the mean obtain from variable 'MIN' at row 9. Is there a way to do it?
you can find the unique prefixes, iterate through each and do fillna for subsets seperately
uniq_prefixes = set([x.split('_')[0] for x in df.columns])
for prfx in uniq_prefixes:
mask = [col for col in df if col.startswith(prfx)]
# Transpose is needed because row wise fillna is not implemented yet
df.loc[:,mask] = df[mask].T.fillna(df[mask].mean(axis=1)).T
Yes, it is possible doing it using the loop. Below is the naive approach, but even for fancier ones, it is not much optimisation (at least I don't see them).
for i, row in df.iterrows():
sum_means = 0
n_means = 0
sum_stds = 0
n_stds = 0
fill_mean_idxs = []
fill_std_idxs = []
for idx, item in item.iteritems():
if idx.startswith('mean') and item is None:
fill_mean_idxs.append(idx)
elif idx.startswith('mean'):
sum_means += float(item)
n_means += 1
elif idx.startswith('std') and item is None:
fill_std_idxs.append(idx)
elif idx.startswith('std'):
sum_stds += float(item)
n_stds += 1
ave_mean = sum_means / n_means
std_mean = sum_stds / n_stds
for idx in fill_mean_idx:
df.loc[i, idx] = ave_mean
for idx in fill_std_idx:
df.loc[i, idx] = std_mean
The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0
I have a dataframe with a few columns (one boolean and one numeric). I want to put conditional formatting using pandas styling since I am going to output my dataframe as html in an email, based on the following conditions: 1. boolean column = Y and 2. numeric column > 0.
For example,
col1 col2
Y 15
N 0
Y 0
N 40
Y 20
In the example above, I want to highlight the first and last row since they meet those conditions.
Yes, there is a way. Use lambda expressions to apply conditions and dropna() function to disclude None/NaN values:
df["col2"] = df["col2"].apply(lambda x:x if x > 0 else None)
df["col1"] = df["col1"].apply(lambda x:x if x == 'Y' else None)
df.dropna()
I used the following and it worked:
def highlight_color(s):
if s.Col2 > 0 and s.Col1 == "N":
return ['background-color: red'] * 7
else:
return ['background-color: white'] * 7
df.style.apply(highlight_color, axis=1).render()
I am trying to populate a dataframe by looking up values in a list of lists and trying to find a match for column/index. I found this post and thought I could modify it for my needs. fill in entire dataframe cell by cell based on index AND column names?. He fills in his pandas dataframe using the edit_distance function. I am currently trying to modify that function so it outputs actual data.
My data set looks something like this but with many more values:
Data = [Product Number, Date, Quantity]
[X1 , 2018-01, 2]
[X1, , 2018-02, 4]
[X1, , 2018-03, 7]
[X2, , 2018-01, 3]
[X3, , 2018-02, 5]
[X3, , 2018-03, 6]
Expected Outcome: Apologies for crude representation
DF = 2018-01 2018-02- 2018-03
X1 2 4 7
X2 3
X3 5 6
I deduped all of the product numbers and dates in my list of lists and set them equal to the below, just as he did in the referenced stack question.
series_rows = pd.Series(prod_deduped)
series_cols = pd.Series(dates_deduped)
His code for mapping all of the cells:
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
The part starting with edit_distance is a function that returns a value based on the inputs for x,y. I created my own function that would loop through a list of lists and return a value based on a match.
def return_value(s1, s2, list_of_lists, starting_point_in_case_of_header):
for row in list_of_lists[starting_point_in_case_of_header:]:
result = ''
product = row[0]
date = row[1]
quantity = row[2]
#for prod in product:
if product == s1 and date == s2:
result = quantity
return result
I get a match on row1, column1 but everything else is blank which makes me think that I really need to be looping through s1 or s2 before everything else. Any help would be appreciated. Thanks!
EDIT: Here is my most recent attempt at trying to loop through s1 and s2 but this just errors out saying my list index is out of range. I think I am on the right track though.
def return_value(s1, s2, list_of_lists, starting_point_in_case_of_header):
for y in enumerate(s2):
result_ = []
for x in enumerate(s1):
for row in list_of_lists[starting_point_in_case_of_header:]:
product = row[0]
date = row[1]
quantity = row[2]
if product == x and date == y:
result_.append(quantity)
result = result_
return result[-1]
My final code to put everything all together:
result_df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: return_value(x, y, sorted_deduped_list, 0))))