dataframe group by for all columns in new dataframe - python

I want to create a new dataframe with the values grouped by each column header dataset
this is the dataset i'm working with.
I essentially want a new dataframe which sums the occurences of 1 and 0 for each feature (chocolate, fruity etc)
i tried this code with the groupby and sort function
`
chocolate = data.groupby(["chocolate"]).size()
bar = data.groupby(["bar"]).size()
hard = data.groupby(["hard"]).size()
display(chocolate,bar, hard)
`
but this only gives me the sum per feature
this is the end result i want to become
end result

You could try the following:
res = (
data
.drop(columns="competitorname")
.melt().value_counts()
.unstack()
.fillna(0).astype("int").T
)
Eliminate the columns that aren't relevant (I've only seen competitorname, but there could be more).
.melt the dataframe. The result has 2 columns, one with the column names, and another with the resp. 0/1 values.
Now .value_counts gives you a series that essentially contains what you are looking for.
Then you just have to .unstack the first index level (column names) and transpose the dataframe.
Example:
data = pd.DataFrame({
"competitorname": ["A", "B", "C"],
"chocolate": [1, 0, 0], "bar": [1, 0, 1], "hard": [1, 1, 1]
})
competitorname chocolate bar hard
0 A 1 1 1
1 B 0 0 1
2 C 0 1 1
Result:
variable bar chocolate hard
value
0 1 2 0
1 2 1 3
Alternative with .pivot_table:
res = (
data
.drop(columns="competitorname")
.melt().value_counts().to_frame()
.pivot_table(index="value", columns="variable", fill_value=0)
.droplevel(0, axis=1)
)
PS: Please don't post images, provide a litte example (like here) that encapsulates your problem.

Related

Select pandas Series elements based on condition

Given a dataframe, I know I can select rows by condition using below syntax:
df[df['colname'] == 'Target Value']
But what about a Series? Series does not have a column (axis 1) name, right?
My scenario is I have created a Series by through the nunique() function:
sr = df.nunique()
And I want to list out the index names of those rows with value 1.
Having failed to find a clear answer on the Net, I resorted to below solution:
for (colname, coldata) in sr.iteritems():
if coldata == 1:
print(colname)
Question: what is a better way to get my answer (i.e list out index names of Series (or column names of the original Dataframe) which has just a single value?)
The ultimate objective was to find which columns in a DF has one and only one unique value. Since I did not know how to do that direct from a DF, I first used nunique() and that gave me a Series. Thus i needed to process the Series with a "== 1" (i.e one and only one)
I hope my question isnt silly.
It is unclear what you want. Whether you want to work on the dataframe or on the Series ?
Case 1: Working on DataFrame
In case you want to work on the dataframe to to list out the index names of those rows with value 1, you can try:
df.index[df[df==1].any(axis=1)].tolist()
Demo
data = {'Col1': [0, 1, 2, 2, 0], 'Col2': [0, 2, 2, 1, 2], 'Col3': [0, 0, 0, 0, 1]}
df = pd.DataFrame(data)
Col1 Col2 Col3
0 0 0 0
1 1 2 0
2 2 2 0
3 2 1 0
4 0 2 1
Then, run the code, it gives:
[1, 3, 4]
Case 2: Working on Series
If you want to extract the index of a Series with value 1, you can extract it into a list, as follows:
sr.loc[sr == 1].index.tolist()
or use:
sr.index[sr == 1].tolist()
It would work the same way, due to the fact that pandas overloads the == operator:
selected_series = series[series == my_value]

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.
First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")
You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary

How to concat multiple dummy variables to dataframe?

I am kind of stuck with a silly issue, can someone please help me pointing out my mistake?
So, have like 5 Categorical variables. I have created their dummies in individual data frames.
seasons = pd.get_dummies(bike['season'], drop_first=True) #3
weathers = pd.get_dummies(bike['weather'], drop_first=True) #3
days = pd.get_dummies(bike['weekday'], drop_first=True)# 6
months = pd.get_dummies(bike['month'], drop_first=True) # 11
years = pd.get_dummies(bike['yr'], drop_first=True) #1
#will add 24 new columns.
Now, when I try to contact them into my main df.
bike = pd.concat([bike, seasons], axis=1)
bike = pd.concat([bike, weathers], axis=1)
bike = pd.concat([bike, months], axis=1)
bike = pd.concat([bike, days], axis=1)
bike = pd.concat([bike, years], axis=1)
bike.info()
I am getting a KeyError: 0 error on bike.info().
Now, upon investigating, I found it is coming only if I try to concat the year df, which is originally indicating one of 2 years 2018: 0, 2019: 1 After the dummy is created this is how it looks.
2019
0 0
1 0
2 0
3 0
4 0
Please Suggest.
Thanks
First of all, do you know why are you using drop_first=True? Just ensuring whether this is what you want to have (removing the first level and having only k-1 categorical levels).
If you want to keep all original data that was not processed by get_dummies method, you do not need to use concat function, it's enough to do bike_with_dummies = pd.get_dummies(bike, columns=['season','weather','weekday','month','yr'], drop_first=True). See example 1. If you want to keep all of them, I would recommend using the code in example 2.
Example 1
You have for example this simple DataFrame (taken from pandas doc)
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]})
When you run
pd.get_dummies(df, columns=['C'], drop_first=True)
it will keep the original columns ("A" and "B") and will convert selected columns ("C" here) to dummies. Output will look like
A B C_2 C_3
0 a b 0 0
1 b a 1 0
2 a c 0 1
Example 2
If you want to keep the original columns as well ("C" from the example above) I would recommend you to do the following
cols_to_dummies = ["C"] # columns that should be turned into dummies
df_with_dummies = pd.get_dummies(df, columns=cols_to_dummies, drop_first=True)
df_with_dummies_and_original = pd.concat([df[cols_to_dummies], df_with_dummies], axis=1)
The output will look like (note that "C" is included now)
C A B C_2 C_3
0 1 a b 0 0
1 2 b a 1 0
2 3 a c 0 1
So in your case you could run this
cols_to_dummies = ['season','weather','weekday','month','yr']
bike_with_dummies = pd.get_dummies(bike, columns=cols_to_dummies, drop_first=True)
bike_with_dummies_and_original = pd.concat([bike[cols_to_dummies], bike_with_dummies], axis=1)
This approach has the advantage that you can easily change cols_to_dummies to update the list of columns that should be turned into dummies and you do not need to add any row.
Final comments - if you prefer better naming, you can use prefix and prefix_sep parameters or do the renaming by yourself at the end.
If this does not help you, please provide example DataFrame (content of bike dataframe).

Calculating overlap between categorical variables in the same DataFrame in Python

I am working on my undergraduate thesis and I have a question about part of my code. I am trying to calculate the number of times each column pair in a df has the same value. The columns are all binary (0, 1).
The input has this format:
df = pd.DataFrame([[1, 0, 1], [0, 0, 1]], columns = ["col1", "col2"])
For instance, the number of times col1 and col2 had the same value in the snippet above is 3.
This is the code i have so far
bl1 = []
bl2 = []
overlap = []
for i in df.iterrows():
for j in range(len(df.columns)):
for k in range(j):
a = df.iloc[j]
b = df.iloc[k]
comparison_column = np.where(a == b)
bl1.append(df.columns[j])
bl2.append(df.columns[k])
overlap.append(len(comparison_column[0]))
After combining the lists into a pd.Dataframe the output looks like this
Base Learner 1 Base Learner 2 overlap
col1 col2 2
I know that the code does not work because I did a count in excel and got different results for the overlap count. I suspect that the loops fail to sum the number of times the pair was found across the df.iterrows() part but i do not know how to fix it. Please give me any suggestions you can. Thanks.
Let's try the magic matrix multiplication:
(df.T # df) + ((1-df.T)#(1-df))
Output:
Rule21 Rule22 Rule23 Rule24
Rule21 5 5 4 2
Rule22 5 5 4 2
Rule23 4 4 5 3
Rule24 2 2 3 5
Explanation: df.T # df counts if corresponding cells in both columns are 1. Similarly, ((1-df.T)#(1-df)) counts if both columns are 0.

How to let only 4 values after dot in all values of a column?

I'm trying to merge two dataframes with common columns similar to the example:
Df 1 : Column A | Column B
234.345564 43.234338
234.345882 23.454138
212.348762 98.454387
123.349834 43.452338
Df 2 : Column A | Column B
234.345564123 43.2343384313
234.345882543 23.4541383413
212.348762113 98.4543872343
123.349834458 43.4523383414
But as you can see, the incertanty in one is bigger to the other. So I'm trying to keep only the first 4 numbers after the dot, in both Dataframe's columns, so that when I merge them more numbers will correspond to each other.
They would then, be like this:
Df 1 : Column A | Column B
234.3455 43.2343
234.3458 23.4541
212.3487 98.4543
123.3498 43.4523
Df 2 : Column A | Column B
234.3455 43.2343
234.3458 23.4541
212.3487 98.4543
123.3498 43.4523
I've worked with the functions I'm listing below before, but I don't know how to aply them here.
.append(pl[0:pl.find('.')])
[f'{i}-{int(j)}' for i, j in map(lambda x: x.split('-'), segunda_lista)]
How can I achieve that?
If your data are numeric, you can do:
df2 = df2.apply(pd.Series.round, decimals=4)
Or if your data is text:
df2 = df2.apply(lambda x: x.str.extract('^(\d*(\.\d{,4})?)')[0])
Output:
ColA ColB
0 234.3456 43.2343
1 234.3459 23.4541
2 212.3488 98.4544
3 123.3498 43.4523

Categories