I am really struggling to make it work...
How can I get a Series, transform it to a dataframe, add a column to it, and concatenate it in a loop?
The pseudo code is below, but the correct syntax is a mystery to me:
The Pseudo code is:
def func_B_Column(df):
return 1
df_1 = (...) # columns=['a', 'etc1', 'etc2']
df_2 = pandas.DataFrame(columns=['a','b','c'])
listOfColumnC = ['c1','c2','c3']
for var in listOfColumnC :
series = df_1.groupby('a').apply(func_B_Column) #series object should have now 'a' as index, and func_B_Column as value
aux = series.to_frame('b')
aux['c'] = aux.apply(lambda x: var, axis=1) #add another column 'c' to the series object
df_2 = df_2 .append(aux) #concatenate the results as rows, at the end
Edited after the question's refinement
df_2 = DataFrame()
for var in listOfColumnC :
df_2 = df_2.append(DataFrame({'b': df_1.groupby('a').apply(func_B_Column), 'c': var}))
Related
I have a pandas DataFrame 'df' with x rows, and another pandas DataFrame 'df2' with y rows
(x < y). I want to return the indexes of where the values of df['Farm'] equals the value of df2['Fields'], in order to add respective 'Manager' to df.
the code I have is as follows:
data2 = [['field1', 'Paul G'] , ['field2', 'Mark R'], ['field3', 'Roy Jr']]
data = [['field1'] , ['field2']]
columns = ['Field']
columns2 = ['Field', 'Manager']
df = pd.DataFrame(data, columns=columns)
df2 = pd.DataFrame(data2, columns=columns2)
farmNames = df['Farm']
exists = farmNames.reset_index(drop=True) == df1['Field'].reset_index(drop=True)
This returns the error message:
ValueError: Can only compare identically-labeled Series objects
Does anyone know how to fix this?
As #NickODell mentioned, you could use a merge, basically a left join. See below code.
df_new = pd.merge(df, df2, on = 'Field', how = 'left')
print(df_new)
Output:
Field Manager
0 field1 Paul G
1 field2 Mark R
With this code:
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i,snlist in list(zip(range(1,13),sn)):
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist, skiprows=range(6))
I get this error:
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist,
skiprows=range(6))
^ SyntaxError: cannot assign to function call
I can't understand the error and how solve. What's the problem?
df+str(i) also return error
i want to make result as:
df1 = pd.read_excel.. list1...
df2 = pd.read_excel... list2....
You can't assign the result of df.read_excel to 'df{}'.format(str(i)) -- which is a string that looks like "df0", "df1", "df2" etc. That is why you get this error message. The error message is probably confusing since its treating this as assignment to a "function call".
It seems like you want a list or a dictionary of DataFrames instead.
To do this, assign the result of df.read_excel to a variable, e.g. df and then append that to a list, or add it to a dictionary of DataFrames.
As a list:
dataframes = []
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes.append(df)
As a dictionary:
dataframes = {}
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes[i] = df
In both cases, you can access the DataFrames by indexing like this:
for i in range(len(dataframes)):
print(dataframes[i])
# Note indexes will start at 0 here instead of 1
# You may want to change your `range` above to start at 0
Or more simply:
for df in dataframes:
print(df)
In the case of the dictionary, you'd probably want:
for i, df in dataframes.items():
print(i, df)
# Here, `i` is the key and `df` is the actual DataFrame
If you really do want df1, df2 etc as the keys, then do this instead:
dataframes[f'df{i}'] = df
I have dataframe as below:
df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
I tried creating a function which allow to change the column name dynamically where I can just input the old column name and new column name in the function as below:
def rename_column_name(df, old_column, new_column):
df = df.rename({'{}'.format(old_column) : '{}'.format(new_column)}, axis=1)
return df
This function is only applicable if I only have one input as below:
new_df = rename_column_name(df, '$a' , 'a')
which give me this new_df as below:
new_df = pd.DataFrame({'a':[1,2], '$b': [10,20]})
However, i wanted to create a function that allow me to make changes on multiple/one column depending on my preference as such:
new_df = rename_column_name(df, ['$a','$b'] , ['a','b'])
And get the new_df as below
new_df = pd.DataFrame({'a':[1,2], 'b': [10,20]})
So, how do I make my function more dynamic to allow me the freedom to enter multiple/one column names and rename them?
You don't need a function, you can do this using dict comprehension:
In [265]: old_names = df.columns.tolist()
In [266]: new_names = ['a','b']
In [268]: df = df.rename(columns=dict(zip(old_names, new_names)))
In [269]: df
Out[269]:
a b
0 1 10
1 2 20
Function that OP needs:
In [274]: def rename_column_name(df, old_column_list, new_column_list):
...: df = df.rename(columns=dict(zip(old_column_list, new_column_list)))
...: return df
...:
In [275]: rename_column_name(df,old_names,new_names)
Out[275]:
a b
0 1 10
1 2 20
You need to pass a list of columns to this function. It can be multiple columns or a single column. This should do what you were looking for.
def rename_column_name(df, old_column, new_column):
if not isinstance(old_column,(list,tuple)):
old_column = [old_column]
if not isinstance(new_column,(list,tuple)):
old_column = [new_column]
df = df.rename({'{}'.format(old) : '{}'.format(new) for old,new in zip(old_column,new_column)}, axis=1)
return df # dang i should have used dict.zip like in the other solution :P
I guess ... although i don't understand how this is easier than just calling
df.rename(columns={'$a':'a','$b':b})
You can do that with zip function where,
old_column_names and new_column_names should be lists.
def rename_column_name(df, old_column_names, new_column_names):
//validating the such that all the new names have been passed
if(len(old_column_names) == len(new_column_names)):
df = df.rename(columns=dict(zip(old_column_names, new_column_names)), inplace=True)
return df
To handle both one column rename and passing them as lists the function would require further conditions which can be
def rename_column_name(df, old_column_names, new_column_names):
//validating the such that all the new names have been passed
if(isinstance(old_column_names, list)) and (isinstance(new_column_names, list)):
if(len(old_column_names) == len(new_column_names)):
df = df.rename(columns=dict(zip(old_column_names, new_column_names)), inplace=True)
elif (isinstance(old_column_names, str)) and (isinstance(new_column_names, str)):
df = df.rename(columns={'{}'.format(old_column_names) : '{}'.format(new_column_names)}, inplace=True)
return df
I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.
I have 2 DataFrames indexed by Time.
import datetime as dt
import pandas as pd
rng1 = pd.date_range("11:00:00","11:00:30",freq="500ms")
df1 = pd.DataFrame({'A':range(1,62), 'B':range(1000,62000,1000)},index = rng)
rng2 = pd.date_range("11:00:03","11:01:03",freq="700ms")
df2 = pd.DataFrame({'Z':range(10,880,10)},index = rng2)
I am trying to assign 'C' in df1 the last element of 'Z' in df2 closest to time index of df1. The following code seems to work now (returns a list).
df1['C'] = None
for tidx,a,b,c in df1.itertuples():
df1['C'].loc[tidx] = df2[:tidx].tail(1).Z.values
#df1['C'].loc[tidx] = df2[:tidx].Z -->Was trying this which didn't work
df1
Is it possible to avoid iterating.
TIL: Pandas Index instances have map method attributes.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.map.html
def fn(df):
def inner(dt):
return df.ix[abs(df.index - dt).argmin(), 'Z']
return inner
df1['C'] = df1.index.map(fn(df2))