I have a DataFrame with various text and numeric columns which I use like a database. Since a column can be of dtype object, I can also store more complex objects inside a single cell, like a numpy array.
How could I store another DataFrame inside a cell?
df1=pd.DataFrame([1,'a'])
df2=pd.DataFrame([2,'b'])
This assigment fails:
df1.loc[0,0] = df2
ValueError: Incompatible indexer with DataFrame
PS. It is not a duplicate question as suggested below since I do not want to concatenate the "sub"-DataFrames
You can use set_value:
df1.set_value(0,0,df2)
or:
df1.iat[0,0]=df2
Since .set_value has been deprecated since version 0.21.0.
Convert your df1 to a dict by using to_dict
df1.loc[0,0] = [df2.to_dict()]
df1
Out[862]:
0
0 [{0: {0: 2, 1: 'b'}}]
1 a
If you need convert it back to dataframe , You can using dataframe constructor
pd.DataFrame(df1.loc[0,0][0])
Out[864]:
0
0 2
1 b
Related
EMAP is a dataframe and I am using "apply" function to perform some action on every row of EMAP dataframe.
The function "Merge" returns "Key Error" on the columns of "row" argument.
But, when I am using the original dataframe (commented in the code) inside the function for Merge, I receive no error.
def merge(row):
a = row[col_select_Event]
#a = EMAP[col_select_Event][1:2]
filtered_RCA = pd.merge(RCA,a, on = col_select_Event, how = 'inner')
return a
j = EMAP.apply(merge, axis = 1)
EMAP data frame is like this
A
B
C
Apple
1
abc
Orange
2
abc
Starwberry
3
abc
RCA data frame is like this
A
B
Apple
1
Orange
2
col_select_Event = ['A','B']
How do I resolve the error?
Apply function always each row of a dataframe as Pandas Series.
So, "Row" argument is of Pandas Series datatype.
EMAP[col_select_Event][1:2] ------> This is of type DataFrame and hence it works
whereas
row[col_select_Event] ---------> This is Pandas Series
You cannot merge Pandas series to a Pandas Dataframe. This is because when using the apply function, columns of dataframe is converted to index of pandas series.
To enable usage of merge function, you must convert pandas series to pandas dataframe
row[col_select_Event].to_frame().T
The above code should work.
Say I have a dataframe df with column "age".
Say "age" has some NaN values and I want to create two new dataframes, dfMean and dfMedian, which fill in the NaN values differently.
This is the way I would do it:
# Step 1:
dfMean = df
dfMean["age"].fillna(df["age"].mean(),inplace=True)
# Step 2:
dfMedian= df
dfMedian["age"].fillna(df["age"].median(),inplace=True)
I'm curious whether there's a way to do each of these steps in one line instead of two, by returning the modified dataframe without needing to copy the original. But I haven't been able to find anything so far. Thanks, and let me know if I can clarify or if you have a better title in mind for the question :)
Doing dfMean = dfMean["age"].fillna(df["age"].mean()) you create a Series, not a DataFrame.
To add two new Series (=columns) to your DataFrame, use:
df2 = df.assign(age_fill_mean=df["age"].fillna(df["age"].mean()),
age_fill_median=df["age"].fillna(df["age"].median()),
)
You alternatively can use alias Pandas.DataFrame.agg()
"Aggregate using one or more operations over the specified axis."
df.agg({'age' : ['mean', 'median']})
No, need 2 times defined new 2 DataFrames by DataFrame.fillna with dictionary for specify columns names for replacement missing values:
dfMean = df.fillna({'age': df["age"].mean()})
dfMedian = df.fillna({'age': df["age"].median()})
One line is:
dfMean,dfMedian=df.fillna({'age': df["age"].mean()}), df.fillna({'age': df["age"].median()})
In order to iterate a list through a function, I used the following code:
tot = {}
for i in list:
tot["tot{0}".format(i)] = stateagg(i) #previously defined function
The output of this is a pandas dictionary, I was wondering if there is a way to output to a dataframe or a way to convert this back to a dataframe.
I have tried
pd.Dataframe.from_dict(tot, orient = 'index')
which results in the following error:
ValueError: If using all scalar values, you must pass an index
Any help much appreciated.
Edit:
apologies I should've been clearer, the function pulls values out of a dataframe to create the dictionary, the data used isn't in list format. The list is used to pull the values out and aggregate data based on the list.
To create a dataframe from list, you can try something as follows
df = pd.DataFrame(my_list.items(), columns={"A", "B"})
For example
my_list = {"Math":25, "Diana":22, "Jhon":30}
df = pd.DataFrame(my_list.items(), columns={"Name", "Age"})
df
Name Age
0 Math 25
1 Diana 22
2 Jhon 30
I am new to programming and I have written a program that reads and modifies a large excel file using Python Pandas. In the code I have the following line:
df1 = df1.apply(lambda x : pd.to_numeric(x,errors='ignore'))
Which does what I need it to, but it also turns the data below my header into floats. Is there a way to have them turn to and int type instead?
df1 is a dataframe and I am attempting to create a nested dictionary with its contents.
Option 2
Use this for a list of numeric columns in an existing dataframe:
cols = ['col1', 'col2', 'col3']
df1[cols] = df1[cols].apply(pd.to_numeric, errors='ignore', downcast='integer')
The standard astype(int) is sub-optimal since it doesn't downcast by default.
Option 1
As #AntonvBR mentions, ideally you want to read in series as downcasted integers, if at all possible. Then this separate conversion would not be necessary.
For example, the dtype parameter of pd.read_excel takes a dictionary input:
df = pd.read_excel('file.xlsx', dtype={'Col1': np.int8})
This will only work if you know your columns in advance.
When calling a function using groupby + apply, I want to go from a DataFrame to a Series groupby object, apply a function to each group that takes a Series as input and returns a Series as output, and then assign the output from the groupby + apply call as a field in the DataFrame.
The default behavior is to have the output from groupby + apply indexed by the grouping fields, which prevents me from assigning it back to the DataFrame cleanly. I'd prefer to have the function I call with apply take a Series as input and return a Series as output; I think it's a bit cleaner than DataFrame to DataFrame. (This isn't the best way of getting to the result for this example; the real application is pretty different.)
import pandas as pd
df = pd.DataFrame({
'A': [999, 999, 111, 111],
'B': [1, 2, 3, 4],
'C': [1, 3, 1, 3]
})
def less_than_two(series):
# Intended for series of length 1 in this case
# But not intended for many-to-one generally
return series.iloc[0] < 2
output = df.groupby(['A', 'B'])['C'].apply(less_than_two)
I want the index on output to be the same as df, otherwise I cant assign
to df (cleanly):
df['Less_Than_Two'] = output
Something like output.index = df.index seems too ugly, and using the group_keys argument doesn't seem to work:
output = df.groupby(['A', 'B'], group_keys = False)['C'].apply(less_than_two)
df['Less_Than_Two'] = output
transform returns the results with the original index, just as you've asked for. It will broadcast the same result across all elements of a group. Caveat, beware that the dtype may be inferred to be something else. You may have to cast it yourself.
In this case, in order to add another column, I'd use assign
df.assign(
Less_Than_Two=df.groupby(['A', 'B'])['C'].transform(less_than_two).astype(bool))
A B C Less_Than_Two
0 999 1 1 True
1 999 2 3 False
2 111 3 1 True
3 111 4 3 False
Assuming your groupby is necessary (and the resulting groupby object will have fewer rows than your DataFrame -- this isn't the case with the example data), then assigning the Series to the 'Is.Even' column will result in NaN values (since the index to output will be shorter than the index to df).
Instead, based on the example data, the simplest approach will be to merge output -- as a DataFrame -- with df, like so:
output = df.groupby(['A','B'])['C'].agg({'C':is_even}).reset_index() # reset_index restores 'A' and 'B' from indices to columns
output.columns = ['A','B','Is_Even'] #rename target column prior to merging
df.merge(output, how='left', on=['A','B']) # this will support a many-to-one relationship between combinations of 'A' & 'B' and 'Is_Even'
# and will thus properly map aggregated values to unaggregated values
Also, I should note that you're better off using underscores than dots in variable names; unlike in R, for instance, dots act as operators for accessing object properties, and so using them in variable names can block functionality/create confusion.