I am encountering pretty strange behavior. If I let
dict = {'newcol':[1,5], 'othercol':[12,-10]}
df = pandas.DataFrame(data=dict)
print df['newcol']
I get back a pandas Series object with 1 and 5 in it. Great.
print df
I get back the DataFrame as I would expect. Cool.
But what if I want to add to a DataFrame a little at a time? (My use case is saving metrics for machine learner training runs happening in parallel, where each process gets a number and then adds to only that row of the DataFrame.)
I can do the following:
df = pandas.DataFrame()
df['newcol'] = pandas.Series()
df['othercol'] = pandas.Series()
df['newcol'].loc[0] = 1
df['newcol'].loc[1] = 5
df['othercol'].loc[0] = 12
df['othercol'].loc[1] = -10
print df['newcol']
I get back the pandas Series I would expect, identical to creating the DataFrame by the first method.
print df
I see printed that df is an Empty DataFrame with columns [newcol, othercol].
Clearly in the second method the DataFrame's contents are equivalent to the first method. So why is it not smart enough to know it is filled? Is there a function I can call to update the DataFrame's knowledge of its own Series so all these (possibly out-of-order) Series can be unified in to a consistent DataFrame?
You would be able to assign data to an empty dataframe using following
df = pd.DataFrame()
df['newcol'] = pd.Series()
df['othercol'] = pd.Series()
df.loc[0, 'newcol'] = 1
df.loc[1, 'newcol'] = 5
df.loc[0, 'othercol'] = 12
df.loc[1, 'othercol'] = -10
newcol othercol
0 1.0 12.0
1 5.0 -10.0
Related
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
I have multiple dataframes with multiple columns as this:
DF =
A B C metadata_Colunm
r1 6 3 9 r1
r2 2 1 1 r2
r3 5 7 2 r3
How can I use a for-loop to iterate over each column to make a new dataframe and then remove rows where values are below 5 for each new dataframe?
The result should look like this:
DF_A=
A metadata_Colunm
6 r1
5 r1
DF_B=
B metadata_Colunm
7 r3
DF_C=
C metadata_Colunm
9 r1
What I have done so far is to make a list over the columns I will use (all excluding metadata) and then go trough the columns as new dataframes. Since I also need to preserve the metadata I add the metadata-column as part of the new dataframe:
DF = DF.drop("metadata_Colunm")
ColList = list(DF)
for item in ColList:
locals()[f"DF_{str(item)}"] = DF[[item, "metadata_Colunm"]]
locals()[f"DF_{str(item)}"] = locals()[f"DF_{str(item)}"].drop(locals()[f"DF_{str(item)}"][locals()[f"DF_{str(item)}"].item > 0.5].index, inplace=True)
But using this I get "AttributeError: 'DataFrame' object has no attribute 'item'.
Any suggestions for making this work, or any other solutions, would be greatly appreciated!
Thanks in advance!
dfs = {}
for col in df.columns[:-1]:
df_new = df[[col, 'metadata_Colunm']]
dfs[col] = df_new[df_new[col] >= 5]
I would make a dictionary to add your new dataframes to, like this:
dictionary = {}
for col in df.columns[:-1]: # all columns but last
new_df = df.loc[:, (col, 'metadata_column')] # make slices
for index, row in new_df.iterrows():
if new_df.loc[index, col] < 5: # remove < 5
new_df.drop(index=index, inplace=True)
dictionary[col] = new_df # add to dictionary so you can refer to later
You can then call each dataframe via e.g. dictionary['A'].
According to this its best practice to slice the dataframe using df.loc() as opposed to df[].
you can apply a filter to the dataframe(s) instead of using a loop
def filter(df, threshold=5):
for column in df.columns:
df = df[df[column]>=threshold]
Then apply the filer to all your dataframes:
dfs = [df1, df2, df3...]
for df in dfs:
filter(df)
I have data frame like -
ID Min_Value Max_Value
1 0 0.10562
2 0.10563 0.50641
3 0.50642 1.0
I have another data frame that contains Value as a column. I want to create a new column in second data frame which returns ID when Value is between Min_Value and Max_Value for a given ID as above data frame. I can use if-else conditions but number of ID's are large and code becomes too bulky. Is there a efficient way to do this?
If I understand correctly, just join/merge it into one DataFrame, using "between" function you can choose right indexes which will be located in the second DataFrame.
import pandas as pd
data = {"Min_Value": [0, 0.10563, 0.50642],
"Max_Value": [0.10562, 0.50641, 1.0]}
df = pd.DataFrame(data,
index=[1, 2, 3])
df2 = pd.DataFrame({"Value": [0, 0.1, 0.58]}, index=[1,2,3])
df = df.join(df2)
mask_between_values = df['Value'].between(df['Min_Value'], df['Max_Value'], inclusive="neither")
# This is the result
df2[mask_between_values]
1 0.00
3 0.58
Suppose you have two dataframes df and new_df. You want to assign a new column as 'new_column' into new_df dataframe. The value of 'Value' column must be between 'Min_Value' and 'Max_Value' from df dataframe. Then this code may help you.
for i in range(0,len(df)):
if df.loc[i,'Max_Value'] > new_df.loc[i,'Value'] and df.loc[i,'Min_value'] < new_df.loc[i,'Value']:
new_df.loc[i,'new_column'] = df.loc[i, 'ID']
Window 10, Python 3.6
I have a dataframe df
df=pd.DataFrame({'name':['boo', 'foo', 'too', 'boo', 'roo', 'too'],
'zip':['30004', '02895', '02895', '30750', '02895', '02895']})
I want to find the repeat record that has same 'name' and 'zip', and record the repeat times. The idea output is
name repeat zip
0 too 1 02895
Because my dataframe is much more than six rows, I need to use a iterate method. I appreciate any tips.
I believe you need groupby all columns and use GroupBy.size:
#create DataFrame from online source
#df = pd.read_csv('someonline.csv')
#df = pd.read_html('someurl')[0]
#L = []
#for x in iterator:
#in loop added data to list
# L.append(x)
##created DataFrame from contructor
#df = pd.DataFrame(L)
df = df.groupby(df.columns.tolist()).size().reset_index(name='repeat')
#if need specify columns
#df = df.groupby(['name','zip']).size().reset_index(name='repeat')
print (df)
name zip repeat
0 boo 30004 1
1 boo 30750 1
2 foo 02895 1
3 roo 02895 1
4 too 02895 2
Pandas has a handy .duplicated() method that can help you identify duplicates.
df.duplicated()
By passing the duplicate vector into a selection you can get the duplicate record:
df[df.duplicated()]
You can get the sum of the duplicated records by using .sum()
df.duplicated().sum()
I have created a Dataframe df by merging 2 lists using the following command:
import pandas as pd
df=pd.DataFrame({'Name' : list1,'Probability' : list2})
But I'd like to remove the first column (The index column) and make the column called Name the first column. I tried using del df['index'] and index_col=0. But they didn't work. I also checked reset_index() and that is not what I need. I would like to completely remove the whole index column from a Dataframe that has been created like this (As mentioned above). Someone please help!
You can use set_index, docs:
import pandas as pd
list1 = [1,2]
list2 = [2,5]
df=pd.DataFrame({'Name' : list1,'Probability' : list2})
print (df)
Name Probability
0 1 2
1 2 5
df.set_index('Name', inplace=True)
print (df)
Probability
Name
1 2
2 5
If you need also remove index name:
df.set_index('Name', inplace=True)
#pandas 0.18.0 and higher
df = df.rename_axis(None)
#pandas bellow 0.18.0
#df.index.name = None
print (df)
Probability
1 2
2 5
If you want to save your dataframe to a spreadsheet for a report.. it is possible to format the dataframe to eliminate the index column using xlsxwriter.
writer = pd.ExcelWriter("Probability" + ".xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Probability', startrow=3, startcol=0, index=False)
writer.save()
index=False will then save your dataframe without the index column.
I use this all the time when building reports from my dataframes.
I think the best way is to hide the index using the hide_index method
df = df.style.hide_index()
this will hide the index from the dataframe.