I'm trying to remove a list of columns from a pandas.DataFrame. I've created a list through
diff = df1.difference(df2)
I would like to do something like:
for name in diff:
dataframe.pop(name)
Is there a way to apply the pop function in a vectorized way with all names in the diff index array?
Thanks all for the help!
Regards,
Jan
as MaxU said, cleanest way to do it is
dataframe.drop(diff, axis=1)
the .pop() method returns the series you are poping, which is not time effective if you only want to delete the column
also, you coulduse the del method which maps to the python magic method df.__delitem__('column'). I would not recommend that.
you can read this great SO answer to learn more about those
Related
I'm trying to obtain the "DESIRED OUTCOME" shown in my image below. I have a somewhat messy way to do it that i came up with but I was hoping there is a more efficient way this could be done using Pandas? Please advise and thank you in advance!
The problem is pretty standard, and so is its solution: group by the first column and join the data in the second column. Note that the function join is not called but passed to apply as a parameter.
df.groupby('Name')['Food'].apply(';'.join)
#Name
#Gary Oranges;Pizza
#John Tacos
#Matt Chicken;Steak
You can group by Name column then aggregate with ';'.join function
df.groupby('Name').agg({'Food': ';'.join})
If I have a "reference" to a dataframe, there appears to be no way to append to it in pandas because neither append nor concat support the inplace=True parameter.
An (overly) simple example:
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df = chosen_df.append(chosen_row)
Now because Python does something akin to copy reference by value, chosen_df will initially be a reference to whichever candidate dataframe passed some_test.
But the update semantics of pandas mean that the referenced dataframe is not updated by the result of the append function; a new label is created instead. I believe, if there was the possibility to use inplace=True this would work, but it looks like that isn't likely to happen, given discussion here https://github.com/pandas-dev/pandas/issues/14796
It's worth noting that with a simpler example using lists rather than dataframes does work, because the contents of lists are directly mutated by append().
So my question is --- How could an updatable abstraction over N dataframes be achieved in Python?
The idiom is commonplace, useful and trivial in languages that allow references, so I'm guessing I'm missing a Pythonic trick, or thinking about the whole problem with the wrong hat on!
Obviously the pure illustrative example can be resolved by duplicating the append in the body of an if...else and concretely referencing each underlying dataframe in turn. But this isn't scalable to more complex examples and it's a generic solution akin to references I'm looking for.
Any ideas?
There is a simple way to do this specifically for pandas dataframes - so I'll answer my own question.
chosen_df, chosen_row = (candidate_a_df, candidate_a_row) if some_test else (candidate_b_df, candidate_b_row)
chosen_df.loc[max_idx+1] = chosen_row
The calculation of max_idx very much depends on the structure of chosen_df. In the simplest case when it is a dataframe with a sequential index starting at 0, then you can simply use the length of the index to calculate it.
If chosen_df is non-sequential you'll need call max() on the index column rather than rely on the length of the index.
If chosen_df is a slice or groupby object then you'll need to calculate the index off the max parent dataframe to ensure it's truly the max across all rows.
I am attempting to sort by the maximum value of a column such as:
dfWinners = df_new_out['Margin'].apply[max().sort_values(ascending=False)]
But am unsure of how to use the apply function for multiple methods such as max and sort.
I'm sure it's a very simple problem but I cannot for the life of me find it through searching on here.
Thank you!
It's not clear what you're trying to do by assigning to dfWinners
If you intend dfWinners to be a list of sorted maximum to minimum values as you've described above then you can use the native sorted() method of python.
dfWinners = sorted(df_new_out['Margin'])
Else, you can sort your dataframe in place
df_new_out.sort_values(by = ['Margin'], ascending=False, inplace=True)
Ok I have this part of code:
def Reading_Old_File(self, Path, turn_index, SKU):
print "Reading Old File! Turn Index = ", turn_index, "SKU= ", SKU
lenght_of_array=0
array_with_data=[]
if turn_index==1:
reading_old_file = open(Path,'rU')
data=np.genfromtxt(reading_old_file, delimiter="''", dtype=None)
for index, line_in_data in enumerate(data, start=0):
if index<3:
print index, "Not Yet"
if index>=3:
print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Reading All Old Items"
i=index-3
old_items_data[i]=line_in_data.split("\t")
old_items_data[i]=[lines_old.strip()for lines_old in old_items_data]
print old_items_data[i]
print len(old_items_data)
So what I am doing here is, I'm reading a file, on my first turn, I want to read it all, and keep all data, so it would be something like:
old_items_data[1]=['123','dog','123','dog','123','dog']
old_items_data[2]=['124','cat','124','cat','124','cat']
old_items_data[n]=['amount of list members is equal each time']
each line of the file should be stored in list, so I can use it in future for comparing, when turn_index will be greater than 2 I'll compare coming line with lines in every list(array) by iterating over all lists.
So question is how do I do it, or is there any better way to compare lists?
I'm new to python so maybe someone could help me with this issue?
Thanks
You just need to use append.
old_items_data.append(line_in_data.split("\t"))
I would use the package pandas for this. It will not only be much quicker, but also simpler. Use pandas.read_table to import the data (specifying delimiter and row-skipping can be done here by passing arguments to sep and skiprows). Then, use pandas.DataFrame.apply to apply your function to the rows of your data.
The speed gains are going to come from the fact that pandas was optimized to perform actions across lists like this (in the case of a pandas DataFrame, these would be called rows). This applies to both importing the data and applying a function to every row. The simplicity gains should hopefully be clear.
I am trying parse a generator to the dataframe constructor, pd.Dataframe testdf = pd.DataFrame(test). I am unable to specify that each line is tab-delimited. The result is that I end up with a single column dataframe where each row is the entire row of values separated with '\t'.
I've tried a couple of other ways:
pd.read_csv(test)
pandas.io.parsers.read_table(test, sep='\t')
but neither of these work of them work because they do not take the input type generator.
Not too familiar with generators. Can you throw them into a list comprehension? If so, how about
pd.DataFrame([x.split('\t') for x in test])
One solution that I found would be to use a split function on the one column to break it up:
testdf_parsed = pd.DataFrame(testdf.row.str.split('\t').tolist(), )
...and that did work for me, but maybe there is a more elegant and simple solution exist that leverages the core capabilities of Pandas?
You might try implementing a file-like object that wraps your generator, then feeding that to read_table.