How to make matrix taking specific columns from Dataframe pandas? - python

I have my data set, https://github.com/mayuripandey/Data-Analysis/blob/main/similarity.csv, is there any way i can make matrix with two specific column and make a matrix of it? For eg:
Count and Topic?

Simply subset the columns of interest, and retrieve the values without the column names using the ".values" attribute.
df = pd.read_html("https://github.com/mayuripandey/Data-Analysis/blob/main/similarity.csv")[0]
df[["Count","Topic"]].values
This returns a 2D numpy array of only the values, then if you need, you can transform into a matrix object like this:
np.matrix(df[["Count","Topic"]].values)

Related

I have a dataframe containing arrays, is there a way collect all of the elements and store it in a seperate dataframe?

I cant seem to find a way to split all of the array values from the column of a dataframe.
I have managed to get all the array values using this code:
The dataframe is as follows:
I want to use value.counts() on the dataframe and I get this
I want the array values that are clubbed together to be split so that I can get the accurate count of every value.
Thanks in advance!
You could try .explode(), which would create a new row for every value in each list.
df_mentioned_id_exploded = pd.DataFrame(df_mentioned_id.explode('entities.user_mentions'))
With the above code you would create a new dataframe df_mentioned_id_exploded with a single column entities.user_mentions, which you could then use .value_counts() on.

How to make a sum by using group by?

I have the following dataset and I want to sum the values of the column UnitPrice grouping by CustomerID.
I'm trying the following way but despite the new column is being added the values are not being filled
data['TotalEN'] = round(data.groupby(['SalesOrderID'])['UnitPrice'].sum(),2)
I tried to print the function if is calculating the values correctly and indeed it is
print(data.groupby(['CustomerID'])['UnitPrice'].sum())
What I'm doing wrong?
In this case, the shape of the output from the groupby operation will be different than the shape of your dataframe. You will need to use the transform method on the groupby object to restore the correct shape you need:
data['TotalEN'] = data.groupby(['SalesOrderID'])['UnitPrice'].transform('sum').round(2)
You can read more about transform here.

putting matrix in one pandas DataFrame cell

I'd like to take a list of 1000 np.ndarrays (each element in the list is an array whose shape is 3X3X8) and use this list as a pandas DataFrame column, so that each cell in the column is a matrix.
How can it be accomplished?
You may want to look at xarray.
I've found this really useful for abstracting "square" data where all of the arrays in your list have the same shape.

Fill missing values (na) with an list/series after modelling missing values

I am trying to plug the predicted missing values into original df (of course to the column with missing value). How could I do so?
The predicted missing values are basically stored in a list/series whose length is the number of missing values in the original df. The order in the list matches with the order that missing values appear in the df, I think, since I split the test_set from the df using nonull() at the missing series.
I have been trying pd.Series.fillna, but that just allows one value to replace.
You can use numpy where and pandas isnull function to do that.
df['relevant_column'] = np.where(df['relevant_column'].isnull(),
predicted_values,
df['relevant_column'])
predicted_values should be a pandas series or 1d numpy array with the same lenght as the dataframe.

indexing into a column in pandas

I am trying to set colb in a pandas array depending on the value in colb.
The order in which I refer to the two column indices in the array seems to have an impact on whether the indexing works. Why is this?
Here is an example of what I mean.
I set up my dataframe:
test=pd.DataFrame(np.random.rand(20,1))
test['cola']=[x for x in range(20)]
test['colb']=0
If I try to set column b using the following code:
test.loc['colb',test.cola>2]=1
I get the error:`ValueError: setting an array element with a sequence
If I use the following code, the code alters the dataframe as I expect.
test.loc[test.cola>2,'colb']=1
Why is this?
Further, is there a better way to assign a column using a test like this?

Categories