I did the for command, to know how many times the ID repeats in the dateframe, now I need to create a column indexing the total in the respective ID.
In short, I need a column with the repeat total of df['ID'], how to index a total of the groupby command?
test = df['ID'].sort_values(ascending=True)
rep = 0
for k in range(0,len(test)-1):
if(test[k]==test[k+1]):
rep += 1
if(k==len(test)-2):
print(test[k],',', rep+1)
else:
print(test[k],',', rep+1)
rep = 0
out:
> 7614381 , 1
> 349444 , 5
> 4577800 ,7
"For example, a column with the total number of times that given df['ID'] appeared in the dataframe"
Is this what you mean?
import pandas as pd
df = pd.DataFrame(
{'id': [7614381, 349444 ,349444, 4577800, 4577800 ,349444, 4577800]}
)
df["id_count"] = df.groupby('id')['id'].transform('count')
df
Output
id id_count
0 7614381 1
1 349444 3
2 349444 3
3 4577800 2
4 349444 3
5 4577800 2
based on: https://stackoverflow.com/a/22391554/11323137
Related
I got a pandas dataframe that looks like this:
I want to count how many rows are for each id and print the result. The problem is I want to count that ONLY for consecutive numbers in "frame num".
For example: if frame num is: [1,2,3,45,47,122,123,124,125] and id is [1,1,1,1,1,1,1,1,1] it should print: 3 1 1 4 (and do that for EACH id).
Is there any way to do that? I got crazy trying to figure it out! To count rows for each id should be enought to use a GROUP BY. But with this new condition its difficult.
You can use pandas.DataFrame.shift() for finding consecutive numbers then use itertools.groupby for creating a list of counting consecutive.
import pandas as pd
from itertools import chain
from itertools import groupby
# Example input dataframe
df = pd.DataFrame({
'num' : [1,2,3,45,47,122,123,124,125,1,2,3,45,47,122,123,124,125],
'id' : [1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2]
})
df['s'] = (df['num']-1 == df['num'].shift()) | (df['num']+1 == df['num'].shift(-1))
res = df.groupby('id')['s'].apply(lambda g: list(chain.from_iterable([[len(list(group))] if key else [1]*len(list(group))
for key, group in groupby( g )])))
print(res)
Output:
id
1 [3, 1, 1, 4]
2 [3, 1, 1, 4]
Name: s, dtype: object
Update: Get the output as a dataframe:
>>> res.to_frame().explode('s').reset_index()
id s
0 1 3
1 1 1
2 1 1
3 1 4
4 2 3
5 2 1
6 2 1
7 2 4
I have a dataset where in I have to identify if the Sale value of a Toy is greater than the average in the column and count in how many different sale areas, value is greater than the average.
For Ex: find the average of column "Sale B" - 2.5, check in how many rows the value is greater than 2.5 and then perform the same exercise for "SaleA" and "SaleC" and then add all of it
input_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5]})
New column "Count_Sale_Average" is created, For ex: for toy "A" only at one position the sale was greater than average.
output_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5],
'Count_Sale_Average':[1,2,1,1]})
My code is working and giving the desired output. Any suggestions on other ways of doing it, may be more efficient and in less number of lines.
list_var = ['SaleA','SaleB','SaleC']
df = input_data[list_var]
for i in range(0,len(list_var)):
var = list_var[i]
mean_var = df[var].mean()
df[var] = df[var].apply(lambda x: 1 if x > mean_var else 0)
df['Count_Sale_Average'] = df[list_var].sum(axis=1)
output_data = pd.concat([input_data, df[['Count_Sale_Average']]], axis=1)
output_data
You could filter, find mean and sum on axis:
filtered = input_data.filter(like='Sale')
input_data['Count_Sale_Average'] = filtered.gt(filtered.mean()).sum(axis=1)
Output:
Toy Color SaleA SaleB SaleC Count_Sale_Average
0 A Red 1 1 5 1
1 B Green 2 3 2 2
2 C Orange 0 4 3 1
3 D Blue 1 2 5 1
I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.
What I have?
I have a dataframe like this:
id value
0 0 5
1 0 5
2 0 6
3 1 7
4 1 7
What I want to get?
I want to drop all the rows with id that has more than one different value. in the example above I want to drop all the rows with id = 0
id value
3 1 7
4 1 7
What I have tried?
import pandas as pd
df = pd.DataFrame({'id':[0, 0, 0, 1, 1], 'value':[5,5,6,7,7]})
print(df)
id_list = df['id'].tolist()
id_set = set(id_list)
for id in id_set:
temp_list = df.loc[df['id'] == id,'value'].tolist()
s = set(temp_list)
if len(s) > 1:
df = df.loc[df['id'] != id]
it works, but it ugly and inefficient
There is a better pytonic way using pandas methods?
Use GroupBy.transform with DataFrameGroupBy.nunique for number of unique values to Series, so possible compare and filter in boolean indexing:
df = df[df.groupby('id')['value'].transform('nunique').eq(1)]
print (df)
id value
3 1 7
4 1 7
# Try this code #
import pandas as pd
id1 = pd.Series([0,0,0,1,1])
value = pd.Series([5,5,6,7,7])
data = pd.DataFrame({'id':id1,'value':value})
datag = data.groupby('id')
# to delete rows,that id have different values
datadel = []
for i in set(data.id):
if len(set(datag.get_group(i)['value'])) != 1:
datadel.extend(data.loc[data["id"] == i].index.tolist())
data.drop(datadel, inplace = True)
print(data)
This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.