Excel 'COUNTIF() ' functionality using Python Pandas - python

How to implement the Excel 'COUNTIF()' using python
Please see the below image for the reference,
I have a column named 'Title' and it contains some text (CD,PDF). And I need to find the count of the string in the column as given below.
No.of CD : 4
No.of PDF: 1
By using Excel I could find the same by using the below formula
=COUNTIF($A$5:$A$9,"CD")
How can I do the same using python.

For a simple summary of list item counts, try .value_counts() on a Pandas data frame column:
my_list = ['CD','CD','CD','PDF','CD']
df['my_column'] = pd.DataFrame(my_list) # create data frame column from list
df['my_column'].value_counts()
... or on a Pandas series:
pd.Series(my_list).value_counts()
Having a column of counts can be especially useful for scrutinizing issues in larger datasets. Use the .count() method to create a column with corresponding counts (similar to using COUNTIF() to populate a column in Excel):
df['countif'] = [my_list.count(i) for i in my_list] # count list item occurrences and assign to new column
display(df[['my_column','countif']]) # view results

I guess you can do map to compare with "CD" then sum all the values
Example:
Create "title" data:
df = pd.DataFrame({"Title":["CD","CD","CD","PDF","CD"]})
The countif using map then sum
df["Title"].map(lambda x: int(x=="CD")).sum()

Related

Split list into a row of columns in Pandas

I have an image that I can read in via imageio.imread(). However, for simplicity's sake, let's say I have a list of [1,2,3,4].
I'd like to transform that list into a dataframe of columns, represented as
pixel_1 pixel_2 pixel_3 pixel_4
0 1 2 3 4
(The column names are not of vital importance; I just want to spread the data across columns.)
I've tried various apply and map methods but I'm just not sure how to read an image via imageio.imread() into a column spread like this.
Hi Check if below lines are helpful
import pandas as pd
# Create Blank Dataframe
df = pd.DataFrame()
input_list = [1,2,3,4]
header_value ='Pixcel'
for item in input_list:
df[header_value+"_"+str(item)]=[item]
# Out of loop
print(df)

Python 3 | How to print the dataframes by their names once they are created dynamically?

I have a Dataframe df_main with the following columns:
ID
Category
Time
Status
XYZ
1
A
value
value
value
2
B
value
value
value
3
C
value
value
value
4
D
value
value
value
5
E
value
value
value
Using the following code, I have created new Dataframes based on Categories in the table. I have created a Dataframe dictionary and created the dataframes in this format df_A, df_B, df_C...
I have stored the row in the new Dataframes equivalent to the Category Name. So, df_A will have the row from df_main which has the Category value "A".
Code:
dict_of_df = {} # initialize empty dictionary
i=0
for index, row in df_main.iterrows():
if i<5:
newname = df_main['Category'].values[i]
dict_of_df["df_{}".format(newname)] = row
i=i+1
I want to print the dataframes by their dataframe name, and not by iterating the dictionary.
It should be like this:
print(df_A)
print(df_B)
print(df_C)
print(df_D)
print(df_E)
How can I achieve this?
A solution without using a dictionary would work too. Any solution is fine as long as I am able to store a row of a specific Category in a new Dataframe specific to Category Name and print it using the Dataframe name.
Let me know if more details are required.
Edit:
This link is somewhat similar to my use case:
Using String Variable as variable name
I wanted to be specific to dataframes, as my end goal was to print the dataframes by their names.
The method mentioned in the answers of that link is specific to variables and would need a different code solution using the exec method for dataframes.
The idea behind this code is to include it in Power BI. Get Source using python script in Power BI accepts dataframes as tables, for which, I would have to declare or print a dataframe in the code.
change your dataframe source
import pandas as pd
df_main = pd.read_excel('main.xlsx') # use data source as per your requirement
dict_of_df = {} # initialize empty dictionary
i = 0
for index, row in df_main.iterrows():
if i < 5:
print('df_'+row['Category'])
newname = df_main['Category'].values[i]
dict_of_df["df_{}".format(newname)] = row
i = i + 1

How to export a dictionary to excel using Pandas

I am trying to export some data from python to excel using Pandas, and not succeeding. The data is a dictionary, where the keys are a tuple of 4 elements.
I am currently using the following code:
df = pd.DataFrame(data)
df.to_excel("*file location*", index=False)
and I get an exported 2-column table as follows:
I am trying to get an excel table where the first 3 elements of the key are split into their own columns, and the 4th element of the key (Period in this case) becomes a column name, similar to the example below:
I have tried using different additions to the above code but I'm a bit new to this, and so nothing is working so far
Based on what you show us (which is unreplicable), you need pandas.MultiIndex
df_ = df.set_index(0) # `0` since your tuples seem to be located at the first column
df_.index = pd.MultiIndex.from_tuples(df_.index) # We convert your simple index into NDimensional index
# `~.unstack` does the job of locating your periods as columns
df_.unstack(level=-1).droplevel(0, axis=1).to_excel(
"file location", index=True
)
you could try exporting to a csv instead
df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv', index = False)
which can then be converted to an excel file easily

Select a subset of an object type cell in panda Dataframe

I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...
Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)
Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)

pandas dataset transformation to normalize the data

I have a csv file like this:
I want to transform it into a pandas dataframe like this:
Basically i'm trying to normalize the dataset to populate a sql table.
I have used json_normalize to create a separate dataset from genres column but I'm at a loss over how to transform both the columns as shown in the above depiction.
Some suggestions would be highly appreciated.
If the genre_id is the only numeric value (as shown in the picture), you can use the following:
#find all occurrences of digits in the column and convert the list items to comma separated string.
df['genre_id'] = df['genres'].str.findall(r'(\d+)').apply(', '.join)
#use pandas.DataFrame.explode to generate new genre_ids by comma separating them.
df = df.assign(genre_id = df.genre_id.str.split(',')).explode('genre_id')
#finally remove the extra space
df['genre_id'] = df['genre_id'].str.lstrip()
#if required create a new dataframe with these 2 columns only
df = df[['id','genre_id']]

Categories