How to aggregate a dataframe then transpose it with Pandas - python

I'm trying to achieve this kind of transformation with Pandas.
I made this code but unfortunately it doesn't give the result I'm searching for.
CODE :
import pandas as pd
df = pd.read_csv('file.csv', delimiter=';')
df = df.count().reset_index().T.reset_index()
df.columns = df.iloc[0]
df = df[1:]
df
RESULT :
Do you have any proposition ? Any help will be appreciated.

First create columns for test nonOK and then use named aggregatoin for count, sum column Values and for count Trues values use sum again, last sum both columns:
df = (df.assign(NumberOfTest1 = df['Test one'].eq('nonOK'),
NumberOfTest2 = df['Test two'].eq('nonOK'))
.groupby('Category', as_index=False)
.agg(NumberOfID = ('ID','size'),
Values = ('Values','sum'),
NumberOfTest1 = ('NumberOfTest1','sum'),
NumberOfTest2 = ('NumberOfTest2','sum'))
.assign(TotalTest = lambda x: x['NumberOfTest1'] + x['NumberOfTest2']))

Related

dataframe if string contains substring divide by 1000

I have this dataframe looking like the below dataframe.
import pandas as pd
data = [['yellow', '800test' ], ['red','900ui'], ['blue','900test'], ['indigo','700test'], ['black','400ui']]
df = pd.DataFrame(data, columns = ['Donor', 'value'])
In the value field, if a string contains say 'test', I'd like to divide these numbers by 1000. What would be the best way to do this?
Check Below code using lambda function
df['value_2'] = df.apply(lambda x: str(int(x.value.replace('test',''))/1000)+'test' if x.value.find('test') > -1 else x.value, axis=1)
df
Output:
df["value\d"] = df.value.str.findall("\d+").str[0].astype(int)
df["value\w"] = df.value.str.findall("[^\d]+").str[0]
df.loc[df["value\w"] == "test", "value\d"] = df["value\d"]/1000
df["value"] = df["value\w"] + df["value\d"].astype(str)

Count occurrence of column values in other dataframe column

I have two dataframes and I want to count the occurrence of "classifier" in "fullname". My problem is that my script counts a word like "carrepair" only for one classifier and I would like to have a count for both classifiers. I would also like to add one random coordinate that matches the classifier.
First dataframe:
Second dataframe:
Result so far:
Desired Result:
My script so far:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
pat = '({})'.format('|'.join(clas['classifier'].unique()))
fl['fullname'] = fl['fullname'].str.extract(pat, expand = False)
clas['count_of_classifier'] = clas['classifier'].map(fl['fullname'].value_counts())
print(clas)
Thanks!
You could try this:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
# Add a new column to 'fl' containing either 'repair' or 'car'
for value in clas["classifier"].values:
fl.loc[fl["fullname"].str.contains(value, case=False), value] = value
# Count values and create a new dataframe
new_clas = pd.DataFrame(
{
"classifier": [col for col in clas["classifier"].values],
"count": [fl[col].count() for col in clas["classifier"].values],
}
)
# Merge 'fl' and 'new_clas'
new_clas = pd.merge(
left=new_clas, right=fl, how="left", left_on="classifier", right_on="fullname"
).reset_index(drop=True)
# Keep only expected columns
new_clas = new_clas.reindex(columns=["classifier", "count", "coordinate"])
print(new_clas)
# Outputs
classifier count coordinate
repair 3 52.520008, 13.404954
car 3 54.520008, 15.404954

How to merge multiple columns with same names in a dataframe

I have the following dataframe as below:
df = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes',
'Data':'Blank',
'Data':'No',
'Logline':'10'}) '''
I need dataframe:
df = pd.DataFrame({'Field':['FAPERF','FAPERF'],
'Form':['LIVERID','LIVERID'],
'Folder':['ALL','ALL'],
'Logline':['9','10'],
'Data':['Yes','Blank','No']}) '''
I had tried using the below code but not able to achieve desired output.
res3.set_index(res3.groupby(level=0).cumcount(), append=True['Data'].unstack(0)
Can anyone please help me.
I believe your best option is to create multiple data frames with the same column name ( example 3 df with column name : "Data" ) then simply perform a concat function over Data frames :
df1 = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes'}
df2 = pd.DataFrame({
'Data':'No',
'Logline':'10'})
df3 = pd.DataFrame({'Data':'Blank'})
frames = [df1, df2, df3]
result = pd.concat(frames)
You just need to add to list in which you specify the logline and data_type for each row.
import pandas as pd
import numpy as np
list_df = []
data_type_list = ["yes","no","Blank"]
logline_type = ["9","10",'10']
for x in range (len(data_type_list)):
new_dict = { 'Field':['FAPERF'], 'Form':['LIVERID'],'Folder':['ALL'],"Data" : [data_type_list[x]], "Logline" : [logline_type[x]]}
df = pd.DataFrame(new_dict)
list_df.append(df)
new_df = pd.concat(list_df)
print(new_df)

Sort a Pandas DataFrame using both Date and Time

I'm Trying to sort my dataframe using "sort_value" Im not getting the desired output
df1 = pd.read_csv('raw data/120_FT DDMG.csv')
df2 = pd.read_csv('raw data/120_FT MG.csv')
df3 = pd.read_csv('raw data/120_FT DD.csv')
dconcat = pd.concat([df1,df2,df3])
dconcat['date'] = pd.to_datetime(dconcat['ActivityDates(Individual)']+' '+dconcat['ScheduledStartTime'])
dconcat.sort_values(by='date')
dconcat = dconcat.set_index('date')
print(dconcat)
sort_values returns a data frame which is sorted if inplace=False.
so dconcat=dconcat.sort_values(by='date')
or you can do dconcat.sort_values(by='date', inplace=True)
you can try this;
dconcat = pd.concat([df1,df2,df3])
dconcat['date'] = pd.to_datetime(dconcat['ActivityDates(Individual)']+' '+dconcat['ScheduledStartTime'])
dconcat.set_index('date', inplace=True)
dconcat.sort_index(inplace=True)
print(dconcat)

When using a pandas dataframe, how do I add column if does not exist?

I'm new to using pandas and am writing a script where I read in a dataframe and then do some computation on some of the columns.
Sometimes I will have the column called "Met":
df = pd.read_csv(File,
sep='\t',
compression='gzip',
header=0,
names=["Chrom", "Site", "coverage", "Met"]
)
Other times I will have:
df = pd.read_csv(File,
sep='\t',
compression='gzip',
header=0,
names=["Chrom", "Site", "coverage", "freqC"]
)
I need to do some computation with the "Met" column so if it isn't present I will need to calculate it using:
df['Met'] = df['freqC'] * df['coverage']
is there a way to check if the "Met" column is present in the dataframe, and if not add it?
You check it like this:
if 'Met' not in df:
df['Met'] = df['freqC'] * df['coverage']
When interested in conditionally adding columns in a method chain, consider using pipe() with a lambda:
df.pipe(lambda d: (
d.assign(Met=d['freqC'] * d['coverage'])
if 'Met' not in d else d
))
If you were creating the dataframe from scratch, you could create the missing columns without a loop merely by passing the column names into the pd.DataFrame() call:
cols = ['column 1','column 2','column 3','column 4','column 5']
df = pd.DataFrame(list_or_dict, index=['a',], columns=cols)
Alternatively you can use get:
df['Met'] = df.get('Met', df['freqC'] * df['coverage'])
If the column Met exists, the values inside this column are taken. Otherwise freqC and coverage are multiplied.

Categories