Convert data frame with a single column to a difference matrix - python

I have a data frame with a single column of values and an index of sample names:
>>> df = pd.DataFrame(data={'value':[1,3,4]},index=['cat','dog','bird'])
>>> print(df)
value
cat 1
dog 3
bird 4
I would like to convert this to a square matrix wherein each cell of the matrix shows the difference between every set of two values:
cat dog bird
cat 0 2 3
dog 2 0 1
bird 3 1 0
Is this possible? If so, how do I go about doing this?
I have tried to use scipy.spatial.distance.squareform to convert my starting data frame into a matrix, but apparently what I am starting with is not the right type of vector. Any help would be much appreciated!

Related

How to change the value of a column items using pandas?

This is my fist question on stackoverflow.
I'm implementing a Machine Learning classification algorithm and I want to generalize it for any input dataset that have their target class in the last column. For that, I want to modify all values of this column without needing to know the names of each column or rows using pandas in python.
For example, let's suppose I load a dataset:
dataset = pd.read_csv('random_dataset.csv')
Let's say the last column has the following data:
0 dog
1 dog
2 cat
3 dog
4 cat
I want to change each "dog" appearence to 1 and each cat appearance to 0, so that the column would look:
0 1
1 1
2 0
3 1
4 0
I have found some ways of changing the values of specific cells using pandas, but for this case, what would be the best way to do that?
I appreciate each answer.
You can use pandas.Categorical:
df['column'] = pd.Categorical(df['column']).codes
You can also use the built in functionality for this too:
df['column'] = df['column'].astype('category').cat.codes
use the map and map the values as per requirement:
df['col_name'] = df['col_name'].map({'dog' : 1 , 'cat': 0})
OR -> Use factorize(Encode the object as an enumerated type) -> if you wanna assign random numeric values
df['col_name'] = df['col_name'].factorize()[0]
OUTPUT:
0 1
1 1
2 0
3 1
4 0

Pandas save counts of multiple columns in single dataframe

I have a dataframe with 3 columns now which appears like this
Model IsJapanese IsGerman
BenzC 0 1
BensGla 0 1
HondaAccord 1 0
HondaOdyssey 1 0
ToyotaCamry 1 0
I want to create a new dataframe and have TotalJapanese and TotalGerman as two columns in the same dataframe.
I am able to achieve this by creating 2 different dataframes. But wondering how to get both the counts in a single dataframe.
please suggest thank you!
Editing and adding another similar dataframe to this [sorry notsure whether its allowed-but trying
Second dataset- am trying to save multiple counts in single dataframe, based on repetition of data.
Here is my sample dataset
Store Address IsLA IsGA
Albertsons Cross St 1 0
Safeway LeoSt 0 1
Albertsons Main St 0 1
RiteAid Culver St 1 0
My aim is to prepare a new dataset with multiple counts per store
The result should be like this
Store TotalStores TotalLA TotalGA
Alberstons 2 1 1
Safeway 1 0 1
RiteAid 1 1 0
Is it possible to achieve these in single dataframe ?
Thanks!
One way would be to store the sum of Japanese cars and German cars, and manually create a dataframe using them:
j , g =sum(df['IsJapanese']),sum(df['IsGerman'])
total_df = pd.DataFrame({'TotalJapanese':j,
'TotalGerman':g},index=['Totals'])
print(total_df)
TotalJapanese TotalGerman
Totals 3 2
Another way would be to transpose (T) your dataframe, sum(axis=1), and tranpose back:
>>> total_df_v2 = pd.DataFrame(df.set_index('Model').T.sum(axis=1)).T
print(total_df_v2)
IsJapanese IsGerman
3 2
To answer your 2nd question, you can use a DataFrameGroupBy.agg on your 'Store' column, use parameter count on Address and sum on your other two columns. Then you can rename() your columns if needed:
resulting_df = df.groupby('Store').agg({'Address':'count',
'IsLA':'sum',
'IsGA':'sum'}).\
rename({'Address':'TotalStores',
'IsLA':'TotalLA',
'IsGA':'TotalGA'},axis=1)
Prints:
TotalStores IsLA IsGA
Store
Albertsons 2 1 1
RiteAid 1 1 0
Safeway 1 0 1

How to parse the output of Pandas' stack() function

stack() is an excellent Pandas function. It return Stacked dataframe or series. How to parse this output and print it well-formatted (like using to_markdown() function)?
>>>> df_single_level_cols
weight height
cat 0 1
dog 2 3
>>>> df_single_level_cols.stack()
cat weight 0
height 1
dog weight 2
height 3
dtype: int64
You mean that with to_markdown the two columns are displayed as tupels, right? Then try to use:
print(df_single_level_cols.stack().reset_index().to_markdown())
Maybe you have to adjust the ibdex and the column names.

How to add a new column and fill it up with a specific value depending on another column's series?

I'm new to Pandas but thanks to Add column with constant value to pandas dataframe I was able to add different columns at once with
c = {'new1': 'w', 'new2': 'y', 'new3': 'z'}
df.assign(**c)
However I'm trying to figure out what's the path to take when I want to add a new column to a dataframe (currently 1.2 million rows * 23 columns).
Let's simplify the df a bit and try to make it more clear:
Order Orderline Product
1 0 Laptop
1 1 Bag
1 2 Mouse
2 0 Keyboard
3 0 Laptop
3 1 Mouse
I would like to add a new column where depending if the Order has at least 1 product == Bag then it should be 1 (for all rows for that specific order), otherwise 0.
Result would become:
Order Orderline Product HasBag
1 0 Laptop 1
1 1 Bag 1
1 2 Mouse 1
2 0 Keyboard 0
3 0 Laptop 0
3 1 Mouse 0
What I could do is find all the unique order numbers, then filter out the subframe, check the Product column for Bag, if found then add 1 to a new column, otherwise 0, and then replace the original subframe with the result.
Likely there's a way better manner to accomplish this, and also way more performant.
The main reason I'm trying to do this, is to flatten things down later on. Every order should become 1 line with some values of product. I don't need the information for Bag anymore but I want to keep in my dataframe if the original order used to have a Bag (1) or no Bag (0).
Ultimately when the data is cleaned out it can be used as a base for scikit-learn (or that's what I hope).
If I understand you correctly, you want GroupBy.transform.any
First we create a boolean array by checking which rows in Product are Bag with Series.eq. Then we GroupBy on this boolean array and check if any of the values are True. We use transform to keep the shape of our initial array so we can assign the values back.
df['ind'] = df['Product'].eq('Bag').groupby(df['Order']).transform('any').astype(int)
Order Orderline Product ind
0 1 0 Laptop 1
1 1 1 Bag 1
2 1 2 Mouse 1
3 2 0 Keyboard 0
4 3 0 Laptop 0
5 3 1 Mouse 0

Update dataframe values that match a regex condition and keep remaining values intact

The following is an excerpt from my dataframe:
In[1]: df
Out[1]:
LongName BigDog
1 Big Dog 1
2 Mastiff 0
3 Big Dog 1
4 Cat 0
I want to use regex to update BigDog values to 1 if LongName is a mastiff. I need other values to stay the same. I tried this, and although it assigns 1 to mastiffs, it nulls all other values instead of keeping them intact.
def BigDog(longname):
if re.search('(?i)mastiff', longname):
return '1'
df['BigDog'] = df['LongName'].apply(BigDog)
I'm not sure what to do, could anybody please help?
You don't need a loop or apply, use str.match with DataFrame.loc:
df.loc[df['LongName'].str.match('(?i)mastiff'), 'BigDog'] = 1
LongName BigDog
1 Big Dog 1
2 Mastiff 1
3 Big Dog 1
4 Cat 0

Categories