Add values in a dataframe row based on a specific condition

Add values in a dataframe row based on a specific condition - python

I have a dataframe with columns. The first is filled with timestamps. I want to create a new column and add 0 or 1 based on the hour value of each timestamp. For example, if %H >= "03" -> 1 else 0.
The df looks like that:
2018-08-29T00:03:09 12 0
2018-08-23T00:08:10 2 0
And I wanted to change values in the 3rd column with "1s" as described. Thank you all in advance for the effort!

lets say you have a dataframe like following,
import pandas as pd
from datetime import datetime
d={'time':['2018-08-29T00:03:09', '2018-08-29T12:03:09', '2018-08-31T10:05:09'],
'serial':[1,2,3]}
data=pd.DataFrame(data=d)
data
time serial
0 2018-08-29T00:03:09 1
1 2018-08-29T12:03:09 2
2 2018-08-31T10:05:09 3
define a function based on which the new column values would be obtained.
def func(t):
if datetime.strptime(t, '%Y-%m-%dT%H:%M:%S').hour >= 3:
return 1
else:
return 0
Now insert a new column apply the function to get the values,
data.insert(2, 'new_column', data['time'].apply(lambda x: func(x)).tolist())
This is the updated dataframe
data
time serial new_column
0 2018-08-29T00:03:09 1 0
1 2018-08-29T12:03:09 2 1
2 2018-08-31T10:05:09 3 1

Related

Pandas dataframe row counts to plt incrementally after every 4 consecutive rows

I am trying to assign as ID to a pandas dataframe based on row count. For this I am trying to apply the below logic to pandas dataframe:
num = df.shape[0]
for i in range(num):
print(math.ceil(i/4))
So the idea is that for every 4 consecutive rows, an ID would be assigned. So the resultant dataframe would look like
col_1 Group_ID
v_1 1
v_2 1
v_3 1
v_4 1
v_5 2
v_6 2
v_7 2
v_8 2
v_9 3
v_10 3
--- And so on.
Just a quick thought. How can I use apply function on df.index.
Can I use the below code?
df['Index'] = df.index
df[GroupID] = df['Index].apply(np.ceil)
Any hints?

You can pass a function to apply, so create a named function and pass it
def everyFour(rowIdx):
return math.ceil(rowIdx / 4)
df['GroupId'] = df['Index'].apply(everyFour)
or just use a lambda
df['GroupId'] = df['Index'].apply(lambda rowIdx: math.ceil(rowIdx / 4))
Note that this will leave the first row with index 0 at 0, so you might want to add 1 to the rowIndex before dividing by 4.

How can I create a column target based on two different columns?

I have the following DataFrame with the columns low_scarcity and high_scarcity (a value is either on high or low scarcity):
id
low_scarcity
high_scarcity
0
When I was five..
1
I worked a lot...
2
I went to parties...
3
1 week ago
4
2 months ago
5
another story..
I want to create another column 'target' that when there's an entry in low_scarcity column, the value will be 0, and when there's an entry in high_scarcity column, the value will be 1. Just like this:
id
low_scarcity
high_scarcity
target
0
When I was five..
0
1
I worked a lot...
1
2
I went to parties...
1
3
1 week ago
0
4
2 months ago
0
5
another story..
1
I tried first replacing the entries with no value with 0 and then create a boolean condition, however I can't use .replace('',0) because the columns that are empty don't appear as empty values.

Supposing your dataframe is called df and that a value is either on on high or low scarcity, the following line of code does it
import numpy as np
df['target'] = 1*np.array(df['high_scarcity']!="")
in which the 1* performs an integer conversion of the boolean values.
If that is not the case, then a more complex approach should be taken
res = np.array(["" for i in range(df.shape[0])])
res[df['high_scarcity']!=""] = 1
res[df['low_scarcity']!=""] = 0
df['target'] = res

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.

First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")

You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary

Pandas save counts of multiple columns in single dataframe

I have a dataframe with 3 columns now which appears like this
Model IsJapanese IsGerman
BenzC 0 1
BensGla 0 1
HondaAccord 1 0
HondaOdyssey 1 0
ToyotaCamry 1 0
I want to create a new dataframe and have TotalJapanese and TotalGerman as two columns in the same dataframe.
I am able to achieve this by creating 2 different dataframes. But wondering how to get both the counts in a single dataframe.
please suggest thank you!
Editing and adding another similar dataframe to this [sorry notsure whether its allowed-but trying
Second dataset- am trying to save multiple counts in single dataframe, based on repetition of data.
Here is my sample dataset
Store Address IsLA IsGA
Albertsons Cross St 1 0
Safeway LeoSt 0 1
Albertsons Main St 0 1
RiteAid Culver St 1 0
My aim is to prepare a new dataset with multiple counts per store
The result should be like this
Store TotalStores TotalLA TotalGA
Alberstons 2 1 1
Safeway 1 0 1
RiteAid 1 1 0
Is it possible to achieve these in single dataframe ?
Thanks!

One way would be to store the sum of Japanese cars and German cars, and manually create a dataframe using them:
j , g =sum(df['IsJapanese']),sum(df['IsGerman'])
total_df = pd.DataFrame({'TotalJapanese':j,
'TotalGerman':g},index=['Totals'])
print(total_df)
TotalJapanese TotalGerman
Totals 3 2
Another way would be to transpose (T) your dataframe, sum(axis=1), and tranpose back:
>>> total_df_v2 = pd.DataFrame(df.set_index('Model').T.sum(axis=1)).T
print(total_df_v2)
IsJapanese IsGerman
3 2
To answer your 2nd question, you can use a DataFrameGroupBy.agg on your 'Store' column, use parameter count on Address and sum on your other two columns. Then you can rename() your columns if needed:
resulting_df = df.groupby('Store').agg({'Address':'count',
'IsLA':'sum',
'IsGA':'sum'}).\
rename({'Address':'TotalStores',
'IsLA':'TotalLA',
'IsGA':'TotalGA'},axis=1)
Prints:
TotalStores IsLA IsGA
Store
Albertsons 2 1 1
RiteAid 1 1 0
Safeway 1 0 1

How to get pandas dataframe series name given a column value?

I have a python pandas dataframe with a bunch of names and series, and I create a final column where I sum up the series. I want to get just the row name where the sum of the series equals 0, so I can then later delete those rows. My dataframe is as follows (the last column I create just to sum up the series):
1 2 3 4 total
Ash 1 0 1 1 3
Bel 0 0 0 0 0
Cay 1 0 0 0 1
Jeg 0 1 1 1 3
Jut 1 1 1 1 4
Based on the last column, the series "Bel" is 0, so I want to be able to print out that name only, and then later I can delete that row or keep a record of these rows.
This is my code so far:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for values in df['total']:
if values == 0:
print(df.index[values)
But this obviously is wrong because I am passing the index of 0 to this loop, which will always print the name of the first row. Not sure what method I can implement here though?
There are great solutions below and I also found a way using a simpler python skill, enumerate (because I still find list comprehension hard to write):
def check_empty(df):
df['total'] = df.sum(axis=1)
for name, values in enumerate(df['total']):
if values == 0:
print(df.index[name])

One possible way may be following where df is filtered using value in total:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
index = df[df['total'] == 0].index.values.tolist()
print(index)
If you would like to iterate through row then, using df.iterrows() may be other way as well:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for index, row in df.iterrows():
if row['total'] == 0:
print(index)

Another option is np.where.
import numpy as np
df.iloc[np.where(df.loc[:, 'total'] == 0)]
Output:
1 2 3 4 total
Bel 0 0 0 0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add values in a dataframe row based on a specific condition - python

Related

Pandas dataframe row counts to plt incrementally after every 4 consecutive rows

How can I create a column target based on two different columns?

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

Pandas save counts of multiple columns in single dataframe

How to get pandas dataframe series name given a column value?

Categories

Resources