Assign operator method chaining str.join() - python

I have the following method chaining code and want to create a new column. but i'm getting an error when doing the following.
(
pd.pivot(test, index = ['file_path'], columns = 'year', values = 'file')
.fillna(0)
.astype(int)
.reset_index()
.assign(hierarchy = file_path.str[1:-1].str.join(' > '))
)
Before the assign method the dataframe looks something like this:
file_path 2017 2018 2019 2020
S:\Test\A 0 0 1 2
S:\Test\A\B 1 0 1 3
S:\Test\A\C 3 1 1 0
S:\Test\B\A 1 0 0 1
S:\Test\B\B 1 0 0 1
The error is : name 'file_path' is not defined.
file_path exists in the data frame but I'm not calling it correctly. What is the proper way to create a new column based on another using assign?

you can pass a callable to assign that accepts the dataframe at that point:
.assign(hierarchy=lambda fr: fr["file_path"].str[1:-1].str.join(" > "))
so that fr will be the thus far modified dataframe (pivoted, index resetted etc.), over which you can access to the column "file_path".

Related

Distinguish repeating column names by adding an integer using pandas

I have some columns that have the same names. I would like to add a 1 to the repeating column names
Data
Date Type hi hello stat hi hello
1/1/2022 a 0 0 1 1 0
Desired
Date Type hi hello stat hi1 hello1
1/1/2022 a 0 0 1 1 0
Doing
mask = df['col2'].duplicated(keep=False)
I believe I can utilize mask, but not sure how to efficiently achieve this without calling out the actual column. I would like to call the full dataset and allow the algorithm to update the dupe.
Any suggestion is appreciated
Use the built-in parser method _maybe_dedup_names():
df.columns = pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(df.columns)
# Date Type hi hello stat hi.1 hello.1
# 0 1/1/2022 a 0 0 1 1 0
This is what pandas uses to deduplicate column headers from read_csv().
Note that it scales to any number of duplicate names:
cols = ['hi'] * 3 + ['hello'] * 5
pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(cols)
# ['hi', 'hi.1', 'hi.2', 'hello', 'hello.1', 'hello.2', 'hello.3', 'hello.4']
In pandas < 1.3:
df.columns = pd.io.parsers.ParserBase({})._maybe_dedup_names(df.columns)
You need to apply the duplicated operation to the column names. And then map the duplication information to a string, which you can then add to the original column names.
df.columns = df.columns+[{False:'',True:'1'}[x] for x in df.columns.duplicated()]
We can do
s = df.columns.to_series().groupby(df.columns).cumcount().replace({0:''}).astype(str).radd('.')
df.columns = (df.columns + s).str.strip('.')
df
Out[153]:
Date Type hi hello stat hi.1 hello.1
0 1/1/2022 a 0 0 1 1 0

CSV: alternative to excel "IF" statement in python. Read column and create a new one with numpy.where or other function

I have a CSV file with several columns and I want to write a code that will read a specific column called 'ARPU average 6 month w/t roaming and discount' and then, create a new column called "Logical" which will be based on numpy.where(). Here is what I got at the moment:
csv_data = pd.read_csv("Results.csv")
data = csv_data[['ARPU average 6 month w/t roaming and discount']]
data = data.to_numpy()
sol = []
for target in data:
if1 = np.where(data < 0, 1, 0)
sol.append(if1)
csv_data["Logical"] = [sol].values
csv_data.to_csv ('Results2.csv', index = False, header=True)
This loop is made incorrectly and does not work. It does not create a new column with the corresponding value for each row. To make it clear: if the value in the column is bigger than 0, it will record "1", otherwise "0". The solution can be in any way (nor np.where(), nor loop is required)
If you want to understand what is "Results.csv"
It is actually a big file with data, I have highlighted the column we work with. The code needs to check if there is a value bigger than 0 in the column and give back 1 or 0 in the new column (as I described in the question)
updated answer
import pandas as pd
f1 = pd.read_csv("L1.csv")
f2 = pd.read_csv("L2.csv")
f3 = pd.merge(f1, f2, on ='CTN', how ='inner')
# f3.to_csv("Results.csv") # -> you do not need to save the file to a csv unless you really want to
# csv_data = pd.read_csv("Results.csv") # -> f3 is already saved in memory you do not need to read it again
# data = csv_data[['ARPU average 6 month w/t roaming and discount']] # -> you do not need this line
f3['Logical'] = (f3['ARPU average 6 month w/t roaming and discount']>0).astype(int)
f3.to_csv('Results2.csv', index = False, header=True)
original answer
Generally you do not need to use a loop when using pandas or numpy. Take this sample dataframe: df = pd.DataFrame([0,1,2,3,0,0,0,1], columns=['data'])
You can simply use the boolean values returned (where column is greater than 0 return 1 else return 0) to create a new column.
df['new_col'] = (df['data'] > 0).astype(int)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1
or if you want to us numpy
df['new_col'] = np.where(df['data']>0, 1, 0)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1

Recover a standard, single-index data frame after using pandas groupby+apply

I want to apply a custom reduction function to each group in a Python dataframe. The function reduces the group to a single row by performing operations that combine several of the columns of the group.
I've implemented this like so:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
"afac": np.random.random(size=1000),
"bfac": np.random.random(size=1000),
"class":np.random.randint(low=0,high=5,size=1000)
})
def f(group):
total_area = group['afac'].sum()
per_area = (group['afac']/total_area).values
per_pop = group['bfac'].values
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop)]})
aggdf = df.groupby('class').apply(f)
My input data frame df looks like:
>>> df
afac bfac class
0 0.689969 0.992403 0
1 0.688756 0.728763 1
2 0.086045 0.499061 1
3 0.078453 0.198435 2
4 0.621589 0.812233 4
But my code gives this multi-indexed data frame:
>>> aggdf
per_apop
class
0 0 0.553292
1 0 0.503112
2 0 0.444281
3 0 0.517646
4 0 0.503290
I've tried various ways of getting back to a "normal" data frame, but none seem to work.
>>> aggdf.reset_index()
class level_1 per_apop
0 0 0 0.553292
1 1 0 0.503112
2 2 0 0.444281
3 3 0 0.517646
4 4 0 0.503290
>>> aggdf.unstack().reset_index()
class per_apop
0
0 0 0.553292
1 1 0.503112
2 2 0.444281
3 3 0.517646
4 4 0.503290
How can I perform this operation and get a normal data frame afterwards?
Update: The output data frame should have columns for class and per_apop. Ideally, the function f can return multiple columns and possibly multiple rows. Perhaps using
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop),2], 'sue':[1,3]})
You can select which level to reset as well as if you want to retain the index using reset_index. In your case, you ended up with a multi-index that has 2 levels: class and one that is unnamed. reset_index allows you to reset the entire index (default) or just the levels you want. In the following example, the last level (-1) is being pulled out of the index. By also using drop=True it is dropped rather than appended as a column in the data frame.
aggdf.reset_index(level=-1, drop=True)
per_apop
class
0 0.476184
1 0.476254
2 0.509735
3 0.502444
4 0.525287
EDIT:
To push the class level of the index back to the data frame, you can simply call .reset_index() again. Ugly, but it work.
aggdf.reset_index(level=-1, drop=True).reset_index()
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530
Alternatively, you could also, reset the index, then just drop the extra column.
aggdf.reset_index().drop('level_1', axis=1)
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530
Make your self-def function return Series
def f(group):
total_area = group['afac'].sum()
per_area = (group['afac']/total_area).values
per_pop = group['bfac'].values
return pd.Series(data={'per_apop': np.sum(per_area*per_pop)})
df.groupby('class').apply(f).reset_index()
class per_apop
0 0 0.508332
1 1 0.505593
2 2 0.488117
3 3 0.481572
4 4 0.500401
Although you have a good answer, a suggestion:
test func for df.groupby(...).apply( func ) on the first group, like this:
agroupby = df.groupby(...)
for key, groupdf in agroupby: # an iterator -> (key, groupdf) ... pairs
break # get the first pair
print( "\n-- first groupdf: len %d type %s \n%s" % (
len(groupdf), type(groupdf), groupdf )) # DataFrame
test = myfunc( groupdf )
# groupdf .col [col] [[col ...]] .set_index .resample ... as usual

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

How to get pandas dataframe series name given a column value?

I have a python pandas dataframe with a bunch of names and series, and I create a final column where I sum up the series. I want to get just the row name where the sum of the series equals 0, so I can then later delete those rows. My dataframe is as follows (the last column I create just to sum up the series):
1 2 3 4 total
Ash 1 0 1 1 3
Bel 0 0 0 0 0
Cay 1 0 0 0 1
Jeg 0 1 1 1 3
Jut 1 1 1 1 4
Based on the last column, the series "Bel" is 0, so I want to be able to print out that name only, and then later I can delete that row or keep a record of these rows.
This is my code so far:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for values in df['total']:
if values == 0:
print(df.index[values)
But this obviously is wrong because I am passing the index of 0 to this loop, which will always print the name of the first row. Not sure what method I can implement here though?
There are great solutions below and I also found a way using a simpler python skill, enumerate (because I still find list comprehension hard to write):
def check_empty(df):
df['total'] = df.sum(axis=1)
for name, values in enumerate(df['total']):
if values == 0:
print(df.index[name])
One possible way may be following where df is filtered using value in total:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
index = df[df['total'] == 0].index.values.tolist()
print(index)
If you would like to iterate through row then, using df.iterrows() may be other way as well:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for index, row in df.iterrows():
if row['total'] == 0:
print(index)
Another option is np.where.
import numpy as np
df.iloc[np.where(df.loc[:, 'total'] == 0)]
Output:
1 2 3 4 total
Bel 0 0 0 0 0

Categories