Create variables with conditional logic from dataframe - python

I have a dataframe with a column called 'success' (amongst others). It this column, we have only 0 and 1 values. Now, I want to count how many times each value occurs.
I tried this command: sdf.groupby('success').sum() but it only gives me a table with the unique counts in 1 view.
Since I need to do math on the individual frequencies of 0 and 1, I need it in 2 variables, thus seperatly. Exmaple:
col1=6100
col2=5878
c=col1/(col1+col2)
How to do this?

You can use value_counts to count how many times each value in a column occurs. Then you could turn the resulting series into a dataframe, and transpose it to get the values as column headers.
counts = pd.DataFrame(sdf['success'].value_counts()).transpose()
Let me know if this works for you.
To do your calculation, you can then try to apply a lambda function to your resulting dataframe (which I named counts). row[0] will access your count of 0s in success since the previous code resulted in a column called 0.
counts['result'] = counts.apply(lambda row: row[0]/(row[0] + row[1]), axis=1)

Related

pandas: return mutated column into original dataframe

Ive attempted to search the forum for this question, but, I believe I may not be asking it correctly. So here it goes.
I have a large data set with many columns. Originally, I needed to sum all columns for each row by multiple groups based on a name pattern of variables. I was able to do so via:
cols = data.filter(regex=r'_name$').columns
data['sum'] = data.groupby(['id','group'],as_index=False)[cols].sum().assign(sum = lambda x: x.sum(axis=1))
By running this code, I receive a modified dataframe grouped by my 2 factor variables (group & id), with all the columns, and the final sum column I need. However, now, I want to return the final sum column back into the original dataframe. The above code returns the entire modified dataframe into my sum column. I know this is achievable in R by simply adding a .$sum at the end of a piped code. Any ideas on how to get this in pandas?
My hopeful output is just a the addition of the final "sum" variable from the above lines of code into my original dataframe.
Edit: To clarify, the code above returns this entire dataframe:
All I want returned is the column in yellow
is this what you need?
data['sum'] = data.groupby(['id','group'])[cols].transform('sum').sum(axis = 1)

Python Pandas using count, drop_duplicates to get the difference of pre-duplicate dropped column count

I have a script to pull a csv file with ~3 million rows of data and narrowing down the columns as i read and save it. I am using the following to count col1 as follows:
My data from col1 is ip addresses
print('total count for: ' + str(df['col1'].count()))
Then using the below code to drop duplicates from this same column from the next line of code.
print(df.col1.duplicated(keep="first").count())
I am now attempting to find the difference between the above. Meaning that I have a .count pre-duplicated of 2368 and after the duplicated I get 2349. I am trying to subtract 2368 from 2349 and print that value.
I have tried multiple variations of .count - .duplicated with no luck. How can I do this?
The .duplicated() method just returns a boolean Series of the same size as the size of the original column, with True values in rows where values are duplicate. Running count() on that column will still produce the same number of rows as the original count.
You could replace it with the drop_duplicates(keep='first') method to actually remove the duplicates:
print(df['col1'].count() - df['col1'].drop_duplicates(keep='first').count())
To simplify, you could just run this expression to get the same value, as it will sum up True values to get you the number of rows with duplicate values:
print(df.col1.duplicated(keep="first").sum())

Python (pandas) loop through values in a column, do a calc with each value

I have a data set of dB values in a dataframe and want to do a calc for each row in a specific column. I've tried this:
for i in dataAnti['antilog']:
x = 10**(i/10)
It gives me the correct value but only loops once. How do I save these new values in a new column or save over the values in the antilog column?
You need to define the new column and simply formulate the calculus you desire.
dataAnti['new_column'] = 10**(dataAnti['antilog']/10)
This will automatically take the value of each row and perform the calculation to assign the resulting value to the same row in the new_column
You can make use of the apply attribute.
dataAnti['result']=dataAnti['antilog'].apply(lambda i: 10**(i/10))
You can pass any function inside apply() that takes an input and applies the result to each column.

Sum count of unique value counts of all series in a Pandas dataframe

I am at my wit's end as I am writing this. This is probably an incredibly small issue, but I've not been able to get around it. Here's what is going on:
I have a dataframe df with 80 columns
Performing value_counts().count() over df iteratively, I am able to print the column names and the number of unique values in that column.
Here's the problem: What I am also wanting to do is sum up the count() of unique values of the all columns. Essentially I will need just one number. S0 basically, if column1 had 10 uniques, column2 had 5, column3 had 3.., I am expecting the sum() to be 18.
About #2, here's what works (simple for loop) -
def counting_unique_values_in_df(df):
for evry_colm in df:
print (evry_colm, "-", df[evry_colm].value_counts().count())
That works. It prints it in this format - the column - unique values
Now, alongside that, I'd like to print the sum of the unique values. Whatever I tried, it either prints the unique value of the last column (which is incidentally 2), or prints some thing random. I know it's something to do with the for loop, but I can't seem to figure out what.
I also know that in order to get what I want, which is essentially sum(df[evry_colm].value_counts().count()), I will need to convert df[evry_colm].value_counts().count() to a series, or even a dataframe, but I am stuck with that too!
Thanks in advance for your help.
You could use nunique, which returns a series across all your columns, which you can then sum:
df.nunique().sum()
My first instinct was to do it by series with a list comprehension
sum([df[col].nunique() for col in list(df)])
but this is slower and less Pandorable!

Cleaning Data: Replacing Current Column Values with Values mapped in Dictionary

I have been trying to wrap my head around this for a while now and have yet to come up with a solution.
My question is how do I change current column values in multiple columns based on the column name if criteria is met???
I have survey data which has been read in as a pandas csv dataframe:
import pandas as pd
df = pd.read_csv("survey_data")
I have created a dictionary with column names and the values I want in each column if the current column value is equal to 1. Each column contains 1 or NaN. Basically any column within the data frame ending in '_SA' =5, '_A' =4, '_NO' =3, '_D' =2 and '_SD' stays as the current value 1. All of the 'NaN' values remain as is. This is the dictionary:
op_dict = {
'op_dog_SA':5,
'op_dog_A':4,
'op_dog_NO':3,
'op_dog_D':2,
'op_dog_SD':1,
'op_cat_SA':5,
'op_cat_A':4,
'op_cat_NO':3,
'op_cat_D':2,
'op_cat_SD':1,
'op_fish_SA':5,
'op_fish_A':4,
'op_fish_NO':3,
'op_fish_D':2,
'op_fish__SD':1}
I have also created a list of the columns within the data frame I would like to be changed if the current column value = 1 called [op_cols]. Now I have been trying to use something like this that iterates through the values in those columns and replaces 1 with the mapped value in the dictionary:
for i in df[op_cols]:
if i == 1:
df[op_cols].apply(lambda x: op_dict.get(x,x))
df[op_cols]
It is not spitting out an error but it is not replacing the 1 values with the corresponding value from the dictionary. It remains as 1.
Any advice/suggestions on why this would not work or a more efficient way would be greatly appreciated
So if I understand your question you want to replace all ones in a column with 1,2,3,4,5 depending on the column name?
I think all you need to do is iterate through your list and multiple by the value your dict returns:
for col in op_cols:
df[col] = df[col]*op_dict[col]
This does what you describe and is far faster than replacing every value. NaNs will still be NaNs, you could handle those in the loop with fillna if you like too.

Categories