How to create a column by applying the command `for`?

How to create a column by applying the command `for`? - python

I did the for command, to know how many times the ID repeats in the dateframe, now I need to create a column indexing the total in the respective ID.
In short, I need a column with the repeat total of df['ID'], how to index a total of the groupby command?
test = df['ID'].sort_values(ascending=True)
rep = 0
for k in range(0,len(test)-1):
if(test[k]==test[k+1]):
rep += 1
if(k==len(test)-2):
print(test[k],',', rep+1)
else:
print(test[k],',', rep+1)
rep = 0
out:
> 7614381 , 1
> 349444 , 5
> 4577800 ,7

"For example, a column with the total number of times that given df['ID'] appeared in the dataframe"
Is this what you mean?
import pandas as pd
df = pd.DataFrame(
{'id': [7614381, 349444 ,349444, 4577800, 4577800 ,349444, 4577800]}
)
df["id_count"] = df.groupby('id')['id'].transform('count')
df
Output
id id_count
0 7614381 1
1 349444 3
2 349444 3
3 4577800 2
4 349444 3
5 4577800 2
based on: https://stackoverflow.com/a/22391554/11323137

Related

Count values of each row in pandas dataframe only for consecutive numbers

I got a pandas dataframe that looks like this:
I want to count how many rows are for each id and print the result. The problem is I want to count that ONLY for consecutive numbers in "frame num".
For example: if frame num is: [1,2,3,45,47,122,123,124,125] and id is [1,1,1,1,1,1,1,1,1] it should print: 3 1 1 4 (and do that for EACH id).
Is there any way to do that? I got crazy trying to figure it out! To count rows for each id should be enought to use a GROUP BY. But with this new condition its difficult.

You can use pandas.DataFrame.shift() for finding consecutive numbers then use itertools.groupby for creating a list of counting consecutive.
import pandas as pd
from itertools import chain
from itertools import groupby
# Example input dataframe
df = pd.DataFrame({
'num' : [1,2,3,45,47,122,123,124,125,1,2,3,45,47,122,123,124,125],
'id' : [1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2]
})
df['s'] = (df['num']-1 == df['num'].shift()) | (df['num']+1 == df['num'].shift(-1))
res = df.groupby('id')['s'].apply(lambda g: list(chain.from_iterable([[len(list(group))] if key else [1]*len(list(group))
for key, group in groupby( g )])))
print(res)
Output:
id
1 [3, 1, 1, 4]
2 [3, 1, 1, 4]
Name: s, dtype: object
Update: Get the output as a dataframe:
>>> res.to_frame().explode('s').reset_index()
id s
0 1 3
1 1 1
2 1 1
3 1 4
4 2 3
5 2 1
6 2 1
7 2 4

Count Number of Times the Sale of a Toy is greater than Average for that Column

I have a dataset where in I have to identify if the Sale value of a Toy is greater than the average in the column and count in how many different sale areas, value is greater than the average.
For Ex: find the average of column "Sale B" - 2.5, check in how many rows the value is greater than 2.5 and then perform the same exercise for "SaleA" and "SaleC" and then add all of it
input_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5]})
New column "Count_Sale_Average" is created, For ex: for toy "A" only at one position the sale was greater than average.
output_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5],
'Count_Sale_Average':[1,2,1,1]})
My code is working and giving the desired output. Any suggestions on other ways of doing it, may be more efficient and in less number of lines.
list_var = ['SaleA','SaleB','SaleC']
df = input_data[list_var]
for i in range(0,len(list_var)):
var = list_var[i]
mean_var = df[var].mean()
df[var] = df[var].apply(lambda x: 1 if x > mean_var else 0)
df['Count_Sale_Average'] = df[list_var].sum(axis=1)
output_data = pd.concat([input_data, df[['Count_Sale_Average']]], axis=1)
output_data

You could filter, find mean and sum on axis:
filtered = input_data.filter(like='Sale')
input_data['Count_Sale_Average'] = filtered.gt(filtered.mean()).sum(axis=1)
Output:
Toy Color SaleA SaleB SaleC Count_Sale_Average
0 A Red 1 1 5 1
1 B Green 2 3 2 2
2 C Orange 0 4 3 1
3 D Blue 1 2 5 1

python panda new column with order of values

I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)

What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.

how to drop rows based on other column only if has multiple different values

What I have?
I have a dataframe like this:
id value
0 0 5
1 0 5
2 0 6
3 1 7
4 1 7
What I want to get?
I want to drop all the rows with id that has more than one different value. in the example above I want to drop all the rows with id = 0
id value
3 1 7
4 1 7
What I have tried?
import pandas as pd
df = pd.DataFrame({'id':[0, 0, 0, 1, 1], 'value':[5,5,6,7,7]})
print(df)
id_list = df['id'].tolist()
id_set = set(id_list)
for id in id_set:
temp_list = df.loc[df['id'] == id,'value'].tolist()
s = set(temp_list)
if len(s) > 1:
df = df.loc[df['id'] != id]
it works, but it ugly and inefficient
There is a better pytonic way using pandas methods?

Use GroupBy.transform with DataFrameGroupBy.nunique for number of unique values to Series, so possible compare and filter in boolean indexing:
df = df[df.groupby('id')['value'].transform('nunique').eq(1)]
print (df)
id value
3 1 7
4 1 7

# Try this code #
import pandas as pd
id1 = pd.Series([0,0,0,1,1])
value = pd.Series([5,5,6,7,7])
data = pd.DataFrame({'id':id1,'value':value})
datag = data.groupby('id')
# to delete rows,that id have different values
datadel = []
for i in set(data.id):
if len(set(datag.get_group(i)['value'])) != 1:
datadel.extend(data.loc[data["id"] == i].index.tolist())
data.drop(datadel, inplace = True)
print(data)

How to process column names and create new columns

This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.

You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1

Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a column by applying the command `for`? - python

Related

Count values of each row in pandas dataframe only for consecutive numbers

Count Number of Times the Sale of a Toy is greater than Average for that Column

python panda new column with order of values

how to drop rows based on other column only if has multiple different values

How to process column names and create new columns

Categories

Resources