Related
I have very junior question in python - i have a dataframe with a column containing some IDs and separate dataframe that contains 2 columns, out of which 1 is an array:
df1 = pd.DataFrame({"some_id": [1, 2, 3, 4, 5]})
df2 = pd.DataFrame([["A", [1, 2]], ["B", [3, 4]], ["C", [5]]], columns=['letter', 'some_ids'])
I want to add do df1 new column "letter' that for a given "some_id" will look up df2, check if this id is in df2['some_ids'] and return df2['letter']
I tried this:
df1['letter'] = df2[df1[some_id].isin(df2['some_ids')].letter
and get NaNs - any suggestion where I make mistake?
Create dictionary with flatten nested lists in dict comprehension and then use Series.map:
d = {x: a for a,b in zip(df2['letter'], df2['some_ids']) for x in b}
df1['letter'] = df1['some_id'].map(d)
Or mapping by Series created by DataFrame.explode with DataFrame.set_index:
df1['letter'] = df1['some_id'].map(df2.explode('some_ids').set_index('some_ids')['letter'])
Or use left join with rename column:
df1 = df1.merge(df2.explode('some_ids').rename(columns={'some_ids':'some_id'}), how='left')
print (df1)
some_id letter
0 1 A
1 2 A
2 3 B
3 4 B
4 5 C
I am trying to rename columns in multiple dataframes and convert those columns to an integer. This is the code I have:
def clean_col(df,col_name):
df.reset_index(inplace=True)
df.rename(columns={df.columns[0]:'Date', df.columns[1]: col_name},inplace=True)
df[col_name]=df[col_name].apply(lambda x: int(x))
I have a dictionary of the dataframe names and the new name of the columns:
d = {
all_df: "all",
coal_df: "coal",
liquids_df: "liquids",
coke_df: "coke",
natural_gas_df: "natural_gas",
nuclear_df: "nuclear",
hydro_electricity_df: "hydro",
wind_df: "wind",
utility_solar_df: "utility_solar",
geothermal_df: "geo_thermal",
wood_biomass_df: "biomass_wood",
biomass_other_df: "biomass_other",
other_df: "other",
solar_all_df: "all_solar",
}
for i, (key, value) in enumerate(d.items()):
clean_col(key, value)
And this is the error I am getting:
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
Any help would be appreciated
You are on the right track by using a dictionary to link your old and new column names. If you loop through your list of dataframes; then loop through your new column names dictionary, that will work.
df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df2 = pd.DataFrame({"A": [1, 2, 3], "D": [4, 5, 6], "F": [4, 5, 6]})
all_dfs = [df1, df2]
display(df1)
display(df2)
d = {
"A": "aaaaa",
"D": "ddddd",
}
for df in all_dfs:
for col in d:
if col in df.columns:
df.rename(columns={col: d.get(col)}, inplace=True)
display(df1)
display(df2)
Using globals (or locals).
import pandas as pd
import io
data1 = '''id,name
1,A
2,B
3,C
4,D
'''
data2 = '''id,name
1,W
2,X
3,Y
4,Z
'''
df1 = pd.read_csv(io.StringIO(data1))
df2 = pd.read_csv(io.StringIO(data2))
def clean_function(dfname, col_name):
df = globals()[dfname] # also see locals()
df.rename(columns={df.columns[0]:'NewID', df.columns[1]: col_name},inplace=True)
return df
mydict = { 'df1': 'NewName', 'df2': 'AnotherName'}
for k,v in mydict.items():
df = clean_function(k,v)
print(df)
Output:
NewID NewName
0 1 A
1 2 B
2 3 C
3 4 D
NewID AnotherName
0 1 W
1 2 X
2 3 Y
3 4 Z
I just created two different lists and then iterated through a list of the dataframes and a list of the new column names
def clean_col(df,col_name):
df.reset_index(inplace=True)
df.rename(columns={df.columns[0]:'Date', df.columns[1]: col_name},inplace=True)
df[col_name]=df[col_name].apply(lambda x: int(x))
list_df=[all_df, coal_df, liquids_df, coke_df, natural_gas_df, nuclear_df, hydro_electricity_df, wind_df, utility_solar_df, geothermal_df, wood_biomass_df, biomass_other_df, other_df, solar_all_df]
list_col=['total', 'coal' , 'liquids' , 'coke' , 'natural_gas', 'nuclear', 'hydro','wind','utility_solar', 'geo_thermal', 'biomass_wood', 'biomass_other', 'other','all_solar']
for a,b in zip(list_df,list_col):
clean_col(a,b)
The pandas explode method creates new row for each value found in the inner list of a given column ; this is so a row-wise explode.
Is there an easy column-wise explode already implemented in pandas, ie something to transform df into the second dataframe ?
MWE:
>>> s = pd.DataFrame([[1, 2], [3, 4]]).agg(list, axis=1)
>>> df = pd.DataFrame({"a": ["a", "b"], "s": s})
>>> df
Out:
a s
0 a [1, 2]
1 b [3, 4]
>>> pd.DataFrame(s.tolist()).assign(a=["a", "b"]).reindex(["a", 0, 1], axis=1)
Out[121]:
a 0 1
0 a 1 2
1 b 3 4
You can use apply to convert those values to Pandas Series, which will ultimately transform the dataframe in the required format:
>>> df.apply(pd.Series)
Out[28]:
0 1
0 1 2
1 3 4
As a side note, your df becomes a Pandas series after using agg
For the updated data, you can concat above result to the existing data frame
>>> pd.concat([df, df['s'].apply(pd.Series)], axis=1)
Out[48]:
a s 0 1
0 a [1, 2] 1 2
1 b [3, 4] 3 4
I have two pandas dataframes that are loaded from CSV files. Each has two columns, column A is an id and is the same value and order in both CSVs. Column B is a numerical value.
I need to create a new CSV with column A identical to the first two and with column B, the average of the two initial CSVs.
I am creating two dataframes like
df1=pd.read_csv(path).set_index('A')
df2=pd.read_csv(otherPath).set_index('A')
If I do
newDf = (df1['B'] + df2['B'])/2
newDf.to_csv(...)
then the newDF has the ids in the wrong order in column A
If i do
df1['B'] = (df1['B'] + df2['B'])/2
df1.to_csv(...)
I get an error on the first line saying "Value Error: cannot reindex from a duplicate axis"
It seems like this should be trivial, what am I doing wrong?
Try using merge instead of setting an index.
I.e. We have these dataframes:
df1 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [3, 4, 5, 6, 7]})
df2 = pd.DataFrame({"A" : [1, 2, 3, 4, 5], "B": [7, 4, 3, 10, 23]})
Then we merge them and create a new column with the mean of both B columns.
together = df1.merge(df2, on='A')
together.loc[:, "mean"] = (together['B_x']+ together['B_y']) / 2
together = together[['A', 'mean']]
And the together is:
A mean
0 1 5.0
1 2 4.0
2 3 4.0
3 4 8.0
4 5 15.0
I am trying to insert two lines into an existing data frame, but can't seem to get it to work. The existing df is:
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
I want to add two blank rows after the 1st and 2nd block rows. I would like the new data frame to look like this:
df_new = pd.DataFrame({"a" : [1,2,0,3,4,0,5,6], "block" : [1, 1, 0, 2, 2, 0, 3, 3]})
There doesn't need to be any values in the rows, I'm planning on using them as placeholders for something else. I've looked into adding rows, but most posts suggest appending one row to the beginning or end of a data frame, which won't work in my case.
Any suggestions as to my dilemma?
import pandas as pd
# Adds a new row to a DataFrame
# oldDf - The DataFrame to which the row will be added
# index - The index where the row will be added
# rowData - The new data to be added to the row
# returns - A new DataFrame with the row added
def AddRow(oldDf, index, rowData):
newDf = oldDf.head(index)
newDf = newDf.append(pd.DataFrame(rowData))
newDf = newDf.append(oldDf.tail(-index))
# Clean up the row indexes so there aren't any doubles.
# Figured you may want this.
newDf = newDf.reset_index(drop=True)
return newDf
# Initial data
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
# Insert rows
blankRow = {"a": [0], "block": [0]}
df2 = AddRow(df1, 2, blankRow)
df2 = AddRow(df2, 5, blankRow)
For the sake of performance, you can removed the reference to Reset_Index() found in the AddRow() function, and simply call it after you've added all your rows.
If you always want to insert the new row of zeros after each group of values in the block column you can do the following:
Start with your data frame:
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
Group it using the values in the block column:
gr = df1.groupby('block')
Add a row of zeros to the end of each group:
df_new = gr.apply(lambda x: x.append({'a':0,'block':0}, ignore_index=True))
Reset the indexes of the new dataframe:
df_new.reset_index(drop = True, inplace=True)
You can simply groupby the data based on the block column, then concat the placeholder at the bottom of each group then append to a new dataframe.
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
df1 # original data
Out[67]:
a block
0 1 1
1 2 1
2 3 2
3 4 2
4 5 3
5 6 3
df_group = df1.groupby('block')
df = pd.DataFrame({"a" : [], "block" : []}) # final data to be appended
for name,group in df_group:
group = pd.concat([group,pd.DataFrame({"a" : [0], "block" : [0]})])
df = df.append(group, ignore_index=True)
df
Out[71]:
a block
0 1 1
1 2 1
2 0 0
3 3 2
4 4 2
5 0 0
6 5 3
7 6 3
8 0 0