I want to exclude the Indexcolumn from the view of a Dataframe:
I sort the whole dataframe based on the values (in decending order) and assign ranks.
It perfectly works, however the indexcolumn is a bit misleading (especially in the ranking).
I already tried to replace the Indexcolumn and used the column Rank as an index by using
df.set_index('Rank', inplace=True)
However, the sorting is then suspended and I may get a key Error if 2 persons (like here) have the same Rank.
My code is:
from scipy.stats import rankdata
import pandas as pd
from tabulate import tabulate
names = ['Tim', 'Tom', 'Sam', 'Kyle']
values = [2, 4, 5, 4]
df = pd.DataFrame({'Name': names,'Values': values})
columns = ["Name", "Values"]
df['Rank'] = df['Values'].rank(method='dense', ascending=False).astype(int)
df.sort_values(by="Rank", ascending=True)
Most (possibly all?) of the pandas to_... methods take the index argument. If you set it to False the index won't be shown. If you really want the pretty HTML output in Jupyter then do
from IPython.display import HTML
HTML(df.sort_values(by="Rank", ascending=True).to_html(index=False))
Related
I'm trying to group values of below list in a dataframe based on Style,Gender and Region but with
values filled down.
My cuurent attempt gets a dataframe without style and region filled down. Not sure if it is good approach or would better
to manipulate the list lst
import pandas as pd
lst = [
['Tee','Boy','East','12','11.04'],
['Golf','Boy','East','12','13'],
['Fancy','Boy','East','12','11.96'],
['Tee','Girl','East','10','11.27'],
['Golf','Girl','East','10','12.12'],
['Fancy','Girl','East','10','13.74'],
['Tee','Boy','West','11','11.44'],
['Golf','Boy','West','11','12.63'],
['Fancy','Boy','West','11','12.06'],
['Tee','Girl','West','15','13.42'],
['Golf','Girl','West','15','11.48']
]
df1 = pd.DataFrame(lst, columns = ['Style','Gender','Region','Units','Price'])
df2 = df1.groupby(['Style','Region','Gender']).count()
Current output (content of df2)
output I'm looking for
You just need to use reset_index that will reset back to normal
df2.reset_index(inplace=True)
I have a Pandas dataframe where its just 2 columns: the first being a name, and the second being a dictionary of information relevant to the name. Adding new rows works fine, but if I try to updates the dictionary column by assigning a new dictionary in place, I get
ValueError: Incompatible indexer with Series
So, to be exact, this is what I was doing to produce the error:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[pos, 'attributes'] = {'c':2}
I was able to find another solution that seems to work:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[:,'attributes'].at[pos] = {'c':2}
but I was hoping to get an answer as to why the first method doesn't work, or if there was something wrong with how I had it initially.
Since you are trying to access a dataframe with an index 'pos', you have to use iloc to access the row. So changing your last row as following would work as intended:
df.iloc[pos]['attributes'] = {'c':2}
For me working DataFrame.at:
df.at[pos, 'attributes'] = {'c':2}
print (df)
name attributes
0 a {'c': 2}
I am looking to select all values that include "hennessy" in the name, i.e. "Hennessy Black Cognac", "Hennessy XO". I know it would simply be
trial = Sales[Sales["Description"]if=="Hennessy"]
if I wanted only the value "Hennessy", but I want it if it contains the word "Hennessy" at all.
working on python with pandas imported
Thanks :)
You can use the in keyword to check if a value is present in a sequence.
Like this:
trial = "hennessy" in lower(Sales[Sales["Description"]])
you can try using str.startswith
import pandas as pd
# initialize list of lists
data = [['Hennessy Black Cognac', 10], ['Hennessy XO', 15], ['julian merger', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
new_df = df.loc[df.Name.str.startswith('Hennessy', na=False)]
new_df
or You can use apply to easily apply any string matching function to your column elementwise
df_new =df[df['Name'].apply(lambda x: x.startswith('Hennessy'))]
df_new
The following code finds any strings for column B. Is it possible to loop over multiple columns of a dataframe outputting the cells containing strings for each column?
import pandas as pd
for i in df:
print(df[df['i'].str.contains(r'^[a-zA-Z]+$')])
Link to code above
https://stackoverflow.com/a/65410078/12801962
Here is how to loop through columns
import pandas as pd
colList = ['ColB', 'Some_other', 'ColC']
for col in colList:
subdf = df[df[col].str.contains(r'^[a-zA-Z]+$')]
#do something with sub DF
or do it in one long test and get all the problem rows in one dataframe
import pandas as pd
subdf = df[((df['ColB'].str.contains(r'^[a-zA-Z]+$')) |
(df['Some_other'].str.contains(r'^[a-zA-Z]+$')) |
(df['ColC'].str.contains(r'^[a-zA-Z]+$')))]
Not sure if it's what you are intending to do
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColA'] = ['ABC', 'DEF', 12345, 23456]
df['ColB'] = ['abc', 12345, 'def', 23456]
all_trues = pd.Series(np.ones(df.shape[0], dtype=np.bool))
for col in df:
all_trues &= df[col].str.contains(r'^[a-zA-Z]+$')
df[all_trues]
Which will give the result:
ColA ColB
0 ABC abc
Try:
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')])
Or, for the values only (no index nor column information):
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')].values)
Note, both of the above only work because you just want to print the matching values in the columns, not return a new structure with filtered entries.
If you tried to make a new DataFrame with cells filtered by the condition, then that would lead to ragged arrays, which are not implemented (you could replace these cells by a marker of your choice, but you cannot cut them away). Another possibility would be to select rows where any or all the cells present the condition you are testing for (that way, the result is an homogeneous array, not a ragged one).
Yet another option would be to return a list of Series, each representing a column, or a dict of colname: Series:
{k: s.loc[s.str.contains(r'^[a-zA-Z]+$')] for k, s in df.astype(str).items()}
I have a dataframe that looks like the following.
import pandas as pd
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Lists':["4,67,3,4,53,32", "7,3,44,2,5,6,9", "8,9,23", "9,36,21,32"]}
# Create DataFrame
df = pd.DataFrame(data)
I want to keep the rows where each list 'Lists' has any value in the pre-defined list [1,2,3,4,5]
What would be the most efficient and rapid way of doing it.
I'd like to avoid a for loop, and asking your proficiency in pandas df to ask you what's the best way to achieve this.
In the example above, this would keep only the rows for 'Tom' and 'nick'.
Many thanks!
This would work:
values = set(str(i) for i in [1, 2, 3, 4, 5]) # note the set
idx = df['Lists'].str.split(',').map(lambda x: len(values.intersection(x)) > 0)
df.loc[idx, 'Name']
0 Tom
1 nick
Name: Name, dtype: object
First convert the values to a set for faster membership tests (if you have many values), then filter rows where 'Lists' intersects the values.