I'm Trying to sort my dataframe using "sort_value" Im not getting the desired output
df1 = pd.read_csv('raw data/120_FT DDMG.csv')
df2 = pd.read_csv('raw data/120_FT MG.csv')
df3 = pd.read_csv('raw data/120_FT DD.csv')
dconcat = pd.concat([df1,df2,df3])
dconcat['date'] = pd.to_datetime(dconcat['ActivityDates(Individual)']+' '+dconcat['ScheduledStartTime'])
dconcat.sort_values(by='date')
dconcat = dconcat.set_index('date')
print(dconcat)
sort_values returns a data frame which is sorted if inplace=False.
so dconcat=dconcat.sort_values(by='date')
or you can do dconcat.sort_values(by='date', inplace=True)
you can try this;
dconcat = pd.concat([df1,df2,df3])
dconcat['date'] = pd.to_datetime(dconcat['ActivityDates(Individual)']+' '+dconcat['ScheduledStartTime'])
dconcat.set_index('date', inplace=True)
dconcat.sort_index(inplace=True)
print(dconcat)
Related
I'm trying to achieve this kind of transformation with Pandas.
I made this code but unfortunately it doesn't give the result I'm searching for.
CODE :
import pandas as pd
df = pd.read_csv('file.csv', delimiter=';')
df = df.count().reset_index().T.reset_index()
df.columns = df.iloc[0]
df = df[1:]
df
RESULT :
Do you have any proposition ? Any help will be appreciated.
First create columns for test nonOK and then use named aggregatoin for count, sum column Values and for count Trues values use sum again, last sum both columns:
df = (df.assign(NumberOfTest1 = df['Test one'].eq('nonOK'),
NumberOfTest2 = df['Test two'].eq('nonOK'))
.groupby('Category', as_index=False)
.agg(NumberOfID = ('ID','size'),
Values = ('Values','sum'),
NumberOfTest1 = ('NumberOfTest1','sum'),
NumberOfTest2 = ('NumberOfTest2','sum'))
.assign(TotalTest = lambda x: x['NumberOfTest1'] + x['NumberOfTest2']))
I'm trying to resample the data, however, it does not seem to be working properly. I want to have start-of-month data to start-of-month.
The code is the following
df = pd.read_csv('OSEBX_daily.csv')
df = data[['time', 'OSEBX GR']]
df['time'] = pd.to_datetime(df['time']).dt.normalize()
df.set_index('time', inplace=True)
df.index = pd.to_datetime(df.index)
df.resample('1M').mean()
df['returns'] = df['OSEBX GR'].pct_change()
plt.plot(df['returns'])
You forget assign back:
df = df.resample('1M').mean()
I have the following dataframe:
data = {'Names':['Abbey','English','Maths','Billy','English','Maths','Charlie','English','Maths'],'Subject Grade':['Student Name',85,91,'Student Name',82,74,'Student Name',83,96]}
df = pd.DataFrame(data, columns = ['Names','Subject Grade'])
I would like to reformat the dataframe in order for the names, subject and grades to all be in their respective columns as follows:
data2 = {'Names':['Abbey','Abbey','Billy','Billy','Charlie','Charlie'],'Subject':['English','Maths','English','Maths','English','Maths'],'Grade':[85,91,82,74,83,96]}
df2 = pd.DataFrame(data2, columns = ['Names','Subject','Grade'])
Hi you can use those instructions :
df['name'] = df['Names'].mask(df['Subject Grade'] != "Student Name")
df['name'] = df['name'].fillna(method='ffill')
df = df.query('`Subject Grade`!="Student Name"')
df = df.rename(columns={'Names':'Subject', 'Subject Grade':'Grade', 'name':'Names'})
I have the following dataframe as below:
df = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes',
'Data':'Blank',
'Data':'No',
'Logline':'10'}) '''
I need dataframe:
df = pd.DataFrame({'Field':['FAPERF','FAPERF'],
'Form':['LIVERID','LIVERID'],
'Folder':['ALL','ALL'],
'Logline':['9','10'],
'Data':['Yes','Blank','No']}) '''
I had tried using the below code but not able to achieve desired output.
res3.set_index(res3.groupby(level=0).cumcount(), append=True['Data'].unstack(0)
Can anyone please help me.
I believe your best option is to create multiple data frames with the same column name ( example 3 df with column name : "Data" ) then simply perform a concat function over Data frames :
df1 = pd.DataFrame({'Field':'FAPERF',
'Form':'LIVERID',
'Folder':'ALL',
'Logline':'9',
'Data':'Yes'}
df2 = pd.DataFrame({
'Data':'No',
'Logline':'10'})
df3 = pd.DataFrame({'Data':'Blank'})
frames = [df1, df2, df3]
result = pd.concat(frames)
You just need to add to list in which you specify the logline and data_type for each row.
import pandas as pd
import numpy as np
list_df = []
data_type_list = ["yes","no","Blank"]
logline_type = ["9","10",'10']
for x in range (len(data_type_list)):
new_dict = { 'Field':['FAPERF'], 'Form':['LIVERID'],'Folder':['ALL'],"Data" : [data_type_list[x]], "Logline" : [logline_type[x]]}
df = pd.DataFrame(new_dict)
list_df.append(df)
new_df = pd.concat(list_df)
print(new_df)
i have two parquet files, which i load with spark.read. These 2 dataframes have a same column named key, so i join them with:
df = df.join(df2, on=['key'], how='inner')
df columns are: ["key","Duration","Distance"] and df2 : ["key",department id"]. At the end i want to print Duration, max(Distance),department id group by department id. What i have done so far is:
df.join(df.groupBy('departmentid').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
but i think it is too slow, is there a faster way to achieve my goal?
thanks in advance
EDIT: sample (first 2 lines of each file)
df:
369367789289,2015-03-27 18:29:39,2015-03-27 19:08:28,-
73.975051879882813,40.760562896728516,-
73.847900390625,40.732685089111328,34.8
369367789290,2015-03-27 18:29:40,2015-03-27 18:38:35,-
73.988876342773438,40.77423095703125,-
73.985160827636719,40.763439178466797,11.16
df1:
369367789289,1
369367789290,2
each columns is seperated by "," first column on both files is my key, then i have timestamps,longtitudes and latitudes. At the second file i have only the key and department id.
to create Distance i am using a function called formater. this is how i get my distance and duration:
df = df.filter("_c3!=0 and _c4!=0 and _c5!=0 and _c6!=0")
df = df.withColumn("_c0", df["_c0"].cast(LongType()))
df = df.withColumn("_c1", df["_c1"].cast(TimestampType()))
df = df.withColumn("_c2", df["_c2"].cast(TimestampType()))
df = df.withColumn("_c3", df["_c3"].cast(DoubleType()))
df = df.withColumn("_c4", df["_c4"].cast(DoubleType()))
df = df.withColumn("_c5", df["_c5"].cast(DoubleType()))
df = df.withColumn("_c6", df["_c6"].cast(DoubleType()))
df = df.withColumn('Distance', formater(df._c3,df._c5,df._c4,df._c6))
df = df.withColumn('Duration', F.unix_timestamp(df._c2) -F.unix_timestamp(df._c1))
and then as i showed above:
df = df.join(vendors, on=['key'], how='inner')
df.registerTempTable("taxi")
df.join(df.groupBy('vendor').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
Output must be
Distance Duration department id
grouped by id, and geting only the row with max(distance)