i want to use the outputs as data and sum them - python

import numpy as np
import pandas as pd
df = pd.read_csv('test_python.csv')
print(df.groupby('fifth').sum())
this is my data
**And I am summing the first three columns for every word is in fifth.
The result is this and it is correct
The next thing I want to do is take those results and sum the together
example:
**buy = 6
cheese = 8
file = 12
.
.
.
word = 13**
How can I do this? how can I use the results.**
-And also now, want to use the column second as a new column with the name second2 with the results as data, how can I do it?

For Summing you can use apply-lambda ;
df = pd.DataFrame({"first":[1]*14,
"second":np.arange(1,15),
"third":[0]*14,
"forth":["one","two","three","four"]*3+["one","two"],
"fifth":["hello","no","hello","hi","buy","hello","cheese","water","hi","juice","file","word","hi","red"]})
df1 = df.groupby(['fifth'])['first','second','third'].agg('sum').reset_index()
df1["sum_3_Col"] = df1.apply(lambda x: x["first"] + x["second"] + x["third"],axis=1)
df1.rename(columns={'second':'second2'}, inplace=True)
Output of df1;

Related

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.
You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.
Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output
This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

Adding empty rows in Pandas dataframe

I'd like to append consistently empty rows in my dataframe.
I have following code what does what I want but I'm struggling in adjusting it to my needs:
s = pd.Series('', data_only_trades.columns)
f = lambda d: d.append(s, ignore_index=True)
set_rows = np.arange(len(data_only_trades)) // 4
empty_rows = data_only_trades.groupby(set_rows, group_keys=False).apply(f).reset_index(drop=True)
How can I adjust the code so I add two or more rows instead of one?
How can I set a starting point (e.g. it should start with row 5 -- Do I have to use .loc then in arange?)
Also tried this code but I was struggling in setting the starting row and the values to blank (I got NaN):
df_new = pd.DataFrame()
for i, row in data_only_trades.iterrows():
df_new = df_new.append(row)
for _ in range(2):
df_new = df_new.append(pd.Series(), ignore_index=True)
Thank you!
import numpy as np
v = np.ndarray(shape=(numberOfRowsYouWant,df.values.shape[1]), dtype=object)
v[:] = ""
pd.DataFrame(np.vstack((df.values, v)))
I think you can use NumPy
but, if you want to use your manner, simply convert NaN to "":
df.fillna("")

Extracting only the percent value in a column in pandas

I have a column that includes strings including a percent at the end e.g XX: (+2, 30%); (-5, 20%); (+17, 50%) .
I need to extract the highest % value for each such string and perform this on the whole column.
Any advice will be highly appreciated!
Thanks
In my understanding, each cell in column XX is a cells which contains some percentages. I have included a small test DataFrame I have used:
import pandas as pd
import re
df = pd.DataFrame({"XX":["(+2, 30%), (-5, 20%), (+17, 50%)","(+2, 70%), (-5, 20%), (+17, 50%)", ""]})
pattern = re.compile("([0-9\.]+)%")
df["XX"].apply(lambda x: max(pattern.findall(x), default=-1))
OUTPUT
0 50
1 70
this code returns the most value in a column having percents
import pandas as pd
import numpy as np
data = [['2.3%', 1],['5.3%', 3]]
data = pd.DataFrame(data)
first_column = data.iloc[:, 0]
percent_list = []
for val in first_column:
percent_list.append(float(val[:-1]))
print(percent_list[np.argmax(percent_list)])

Count occurrence of column values in other dataframe column

I have two dataframes and I want to count the occurrence of "classifier" in "fullname". My problem is that my script counts a word like "carrepair" only for one classifier and I would like to have a count for both classifiers. I would also like to add one random coordinate that matches the classifier.
First dataframe:
Second dataframe:
Result so far:
Desired Result:
My script so far:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
pat = '({})'.format('|'.join(clas['classifier'].unique()))
fl['fullname'] = fl['fullname'].str.extract(pat, expand = False)
clas['count_of_classifier'] = clas['classifier'].map(fl['fullname'].value_counts())
print(clas)
Thanks!
You could try this:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
# Add a new column to 'fl' containing either 'repair' or 'car'
for value in clas["classifier"].values:
fl.loc[fl["fullname"].str.contains(value, case=False), value] = value
# Count values and create a new dataframe
new_clas = pd.DataFrame(
{
"classifier": [col for col in clas["classifier"].values],
"count": [fl[col].count() for col in clas["classifier"].values],
}
)
# Merge 'fl' and 'new_clas'
new_clas = pd.merge(
left=new_clas, right=fl, how="left", left_on="classifier", right_on="fullname"
).reset_index(drop=True)
# Keep only expected columns
new_clas = new_clas.reindex(columns=["classifier", "count", "coordinate"])
print(new_clas)
# Outputs
classifier count coordinate
repair 3 52.520008, 13.404954
car 3 54.520008, 15.404954

Splitting text and numbers in dataframe in python

I have a dataframe df with column name 'col' as the second column and the data looks like:
Dataframe
Want to separate text part in one column with name "Casing Size" and numerical part with "DepthTo" in other column.
Desired Output
import pandas as pd
import io
from google.colab import files
uploaded = files.upload()
df = pd.read_excel(io.BytesIO(uploaded['Test-Checking.xlsx']))
#Method 1
df2 = pd.DataFrame(data=df, columns=['col'])
df2 = df2.col.str.extract('([a-zA-Z]+)([^a-zA-Z]+)', expand=True)
df2.columns = ['CasingSize', 'DepthTo']
df2
#Method 2
def split_col(x):
try:
numb = float(x.split()[0])
txt = x.split()[1]
except:
numb = float(x.split()[1])
txt = x.split()[0]
x['col1'] = txt
x['col2'] = numb
df2['col1'] = df.col.apply(split_col)
df2
Tried two methods but none of them work correctly. Is there anyone help me?
Code in Google Colab
Excel File Attached
Try this
first you need to return the the values from your functions. then you can unpack them into your columns using the to_list()
def sample(x):
b,y=x.split()
return b,y
temp_df=df2['col'].apply(sample)
df2[['col1','col2']]=pd.DataFrame(temp_df.tolist())
You could try splitting the values into a list, then sorting them so that the numerical part comes first. Then you could apply pd.Series and assign back to the two columns.
import pandas as pd
df = pd.DataFrame({'col':["PWT 69.2", '283.5 HWT', '62.9 PWT', '284 HWT']})
df[['Casing Size','DepthTO']] = df['col'].str.split().apply(lambda x: sorted(x)).apply(pd.Series)
print(df)
Output
col Casing Size DepthTO
0 PWT 69.2 69.2 PWT
1 283.5 HWT 283.5 HWT
2 62.9 PWT 62.9 PWT
3 284 HWT 284 HWT

Categories