How to create binary representations of words in pandas column? - python

I have a column which contains lists of variable sizes. The lists contain a limited amount of short text values. Around 60 unique values all together.
0 ["AC","BB"]
1 ["AD","CB", "FF"]
2 ["AA","CC"]
3 ["CA","BB"]
4 ["AA"]
I want to make this values columns in my data-frame and the values of this columns would be 1 if the values is in this row and 0 if not.
I know I could expand the list and than call unique and set those as new columns. But after that I don't know what to do?

Here's one way:
df = pd.get_dummies(df.explode('val')).sum(level = 0)
NOTE: Here (level=0) is kind of like a grouping operation that uses an index for grouping stuff. So, I prefer to use this after exploding the dataframe.

Related

I need to slice a lists column into 5 columns in pandas/spark dataframe

I have a spark dataframe with 3 columns
df = df.select("timestamp","outer_value","values")
current format
“Values” column contains lists of variable lengths (although in multiple of 5, so for instance 1st row might have 30 values, 2nd row might have 60 values…etc)
I want to split values column in such a way that – it takes first 5 values and puts it in 5 new columns and then next 5 values in the second row of the same 5 columns and so on
So lets say for above example 1st row of “values” column is a list of 15 elements and second row is a list of 20 and 3rd row is a list 10 elements than my dataframe should look something like this
desired format
Is there a better way to do this?
I tried F.slice by using something like this but this only works for fixed values assuming I take 60 as a fixed number for 12 slices of 5 each but I need it to be dividing the values column dynamically so it stops when there are no more list elements left.
code I tried
So instead of having 12 slices of 5 each, I need it to keep stacking up 5 values into 5 separate columns and then from 6th value onwards return to the 1st column and insert values 6-10 into the same 5 columns and keep doing this until each row list is finished.
Is there a better way to do it.

split strings from a column in separate columns

I am trying to split string values from a column, in as many columns as strings are in each row.
I am creating a new dataframe with three columns and I have the string values in the third column, I want to split in new columns (which already have headers) but the numbers of strings which are separated by semicolon, is different in each row
If I use this code:
df['string']= df['string'].str.split(';', expand=True)
then I will have left only one value in the column while the rest of the string values will not be split but eliminated.
Cal u advice on how this line of code should be modified in order to have the right output?
many thanks in advance
Instead of overwriting the original column, you can take the result of split and join with original DataFrame
df = pd.DataFrame({'my_string':['car;war;bus','school;college']})
df = df.join(df['my_string'].str.split(';',expand=True))
print(df)
my_string 0 1 2
0 car;war;bus car war bus
1 school;college school college None
Then we do
df['string']= df['string'].str.split(';', expand=True).str[0]

How to delete multiple rows in a pandas DataFrame based on condition?

I know how to delete rows and columns from a dataframe using .drop() method, by passing axis and labels.
Here's the Dataframe:
Now, if i want to remove all rows whose STNAME is equal to from (Arizona all the way to Colorado), how should i do it ?
I know i could just do it by passing row labels 2 to 7 to .drop() method but if i have a lot of data and i don't know the starting and ending indexes, it won't be possible.
Might be kinda hacky, but here is an option:
index1 = df.index[df['STNAME'] == 'Arizona'].tolist()[0]
index2 = df.index[df['STNAME'] == 'Colorado'].tolist()[-1:][0]
df = df.drop(np.arange(index1, index2+1))
This basically takes the first index number of Arizona and the last index number of Colorado, and deletes every row from the data frame between these indexes.

Counting frequency of specific dataframe cells

I need to count the specific number of times a number is seen within a specific cell.
DataFrame ScreenShot
The values are between 1 to 7.
In this column Entity_Types, the first occurrence has 7,7,6,7,6,7,1,7,7,7,2. I think I need to create 7 additional empty columns and count the frequency of each occurrence(for each number) and append them to a new column labeled Entity_Types_1,Entity_Types_2...etc.
Example: New column 7 would have each count of 7 while New Column 1 would have the count of all 1's in that respective cell. I have a table that has 30,000 rows so I was wondering how to run it in a loop to fill out the rest of the dataset.
I can easily do it in excel using this formula
=SUMPRODUCT(LEN(O2)-LEN(SUBSTITUTE(O2,"2","")))
Where O2 is Entity_Types and "2" = the number we are looking to find.
End Example
It looks like Entity_Types is a column of strings in you data frame df. If that is the case, you can use:
for i in range(8):
df['Entity_Types_{}'.format(i)] = df.Entity_Types.str.count(str(i))

Accessing data in a Pandas dataframe with one row

I use Pandas dataframes to manipulate data and I usually visualise them as virtual spreadsheets, with rows and columns defining the positions of individual cells. I'm happy with the methods to slice and dice the dataframes but there seems to be some odd behaviour when the dataframe contains a single row. Basically, I want to select rows of data from a large parent dataframe that meet certain criteria and then pass those results as a daughter dataframe to a separate function for further processing. Sometimes there will only be a single record in the parent dataframe that meets the defined criteria and, therefore, the daughter dataframe will only contain a single row. Nevertheless, I still need to be able to access data in the daughter in the same way as for the parent database. To illustrate may point, consider the following dataframe:
import pandas as pd
tempDF = pd.DataFrame({'group':[1,1,1,1,2,2,2,2],
'string':['a','b','c','d','a','b','c','d']})
print(tempDF)
Which looks like:
group string
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 2 d
As an example, I can now select those rows where 'group' == 2 and 'string' == 'c', which yields just a single row. As expected, the length of dataframe is 1 and it's possible to print just a single cell using .ix() based on index values in the original dataframe:
tempDF2 = tempDF.loc[((tempDF['group']==2) & (tempDF['string']=='c')),['group','string']]
print(tempDF2)
print('Length of tempDF2 = ',tempDF2.index.size)
print(tempDF2.loc[6,['string']])
Output:
group string
6 2 c
Length of tempDF2 = 1
string c
However, if I select a single row using .loc, then the dataframe is printed in a transposed form and the length of the dataframe is now given as 2 (rather than 1). Clearly, it's no longer possible to select single cell values based on index of original parent dataframe:
tempDF3 = tempDF.loc[6,['group','string']]
print(tempDF3)
print('Length of tempDF3 = ',tempDF3.index.size)
Output:
group 2
string c
Name: 7, dtype: object
Length of tempDF3 = 2
In my mind, both these methods are actually doing the same thing, namely selecting a single row of data. However, in the second example, the rows and columns are transposed making it impossible to extract data in an expected way.
Why should these 2 behaviours exist? What is the point of transposing a single row of a dataframe as a default behaviour? How can I make sure that a dataframe containing a single row isn't transposed when I pass it to another function?
tempDF3 = tempDF.loc[6,['group','string']]
The 6 in the first position of the .loc selection dictates that the return type will be a Series and hence your problem. Instead use [6]:
tempDF3 = tempDF.loc[[6],['group','string']]

Categories