How to convert text and numbers to dataframe in Python? - python

I have a text document with spending information. I want to use pandas and Python 3 to convert the text to a dataframe with two columns, without repeating row names by combining same names into one row with the respective amounts added to produce a single total.
Original "spending.txt:"
shaving 150
shaving 200
coffee 100
food 350
transport 60
transport 40
desired output dataframe:
CATEGORY TOTAL
shaving 350
coffee 100
food 350
transport 100

This should do it:
df = pd.read_csv('spending.txt', header=None, sep='\s+')
df.columns = ['category', 'total']
df.groupby('category', as_index=False).sum()
category total
0 coffee 100
1 food 350
2 shaving 350
3 transport 100

Reading in data
temp = StringIO("""
shaving 150
shaving 200
coffee 100
food 350
transport 60
transport 40
""")
df = pd.read_csv(temp, sep='\s+', engine='python', header=None)
df.groupby(0).sum().reset_index().rename({0:'category',1:'total'}, axis=1)
Output
category total
0 coffee 100
1 food 350
2 shaving 350
3 transport 100

Read file using read_csv
The apply group by
df = pd.read_csv('test.txt', sep=" ", header=None)
df.rename(columns={0:'category',1:'Total'},inplace=True)
final_df = df.groupby(['category'],as_index=False)['Total'].sum()
print(final_df)
category Total
0 coffee 100
1 food 350
2 shaving 350
3 transport 100

Related

how to sum multiple row data using pandas and data is excel formet

Hi Everyone how to sum multiple row data using pandas and the data is excel format, and sum only edwin and maria data,please help me out thanks in advance.
excel data
name
salary
incentive
0
john
2000
400
1
edwin
3000
600
2
maria
1000
200
expected output
name
salary
incentive
0
Total
5000
1000
1
john
2000
400
2
edwin
3000
600
3
maria
1000
200
Judging by the Total line, you need the sums of 'john', 'edwin', not edwin and maria. I used the isin function, which returns a boolean mask, which is then used to select the desired rows (the ind variable). In the dataframe, the line with Total is filled with sums. Then pd.concat is used to concatenate the remaining lines. On the account sum in Excel, do not understand what you want?
import pandas as pd
df = pd.DataFrame({'name':['john', 'edwin', 'maria'], 'salary':[2000, 3000, 1000], 'incentive':[400, 600, 200]})
ind = df['name'].isin(['john', 'edwin'])
df1 = pd.DataFrame({'name':['Total'], 'salary':[df.loc[ind, 'salary'].sum()], 'incentive':[df.loc[ind, 'incentive'].sum()]})
df1 = pd.concat([df1, df])
df1 = df1.reset_index().drop(columns='index')
print(df1)
Output
name salary incentive
0 Total 5000 1000
1 john 2000 400
2 edwin 3000 600
3 maria 1000 200

check specific values in a dataframe and make the sum

I want to make the sum of each 'Group' which have at least one 'Customer' with an 'Active' Bail.
Sample Input :
Customer ID Group Bail Amount
0 23453 NAFNAF Active 200
1 23849 LINDT Active 350
2 23847 NAFNAF Inactive 100
3 84759 CARROUF Inactive 20
For example 'NAFNAF' has 2 customers, including one with an active bail.
Output expected :
NAFNAF : 300
LINDT : 350
TOTAL ACTIVE: 650
I don't wanna change the original dataframe
You can use:
(df.assign(Bail=df.Bail.eq('Active'))
.groupby('Group')[['Bail', 'Amount']].agg('sum')
.loc[lambda d: d['Bail'].ge(1), ['Amount']]
)
output:
Amount
Group
LINDT 350
NAFNAF 300
Full output with total:
df2 = (
df.assign(Bail=df.Bail.eq('Active'))
.groupby('Group')[['Bail', 'Amount']].agg('sum')
.loc[lambda d: d['Bail'].ge(1), ['Amount']]
)
df2 = pd.concat([df2, df2.sum().to_frame('TOTAL').T])
output:
Amount
LINDT 350
NAFNAF 300
TOTAL 650
Create a boolean mask of Group with at least one active lease:
m = df['Group'].isin(df.loc[df['Bail'].eq('Active'), 'Group'])
out = df[m]
At this point, your filtered dataframe looks like:
>>> out
Customer ID Group Bail Amount
0 23453 NAFNAF Active 200
1 23849 LINDT Active 350
2 23847 NAFNAF Inactive 100
Now you can use groupby and sum:
out = df[m].groupby('Group')['Amount'].sum()
out = pd.concat([out, pd.Series(out.sum(), index=['TOTAL ACTIVE'])])
# Output
LINDT 350
NAFNAF 300
TOTAL ACTIVE 650
dtype: int64

Add values in columns if criteria from another column is met

I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.

Update Specific Pandas Rows with Value from Different Dataframe

I have a pandas dataframe that contains budget data but my sales data is located in another dataframe that is not the same size. How can I get my sales data updated in my budget data? How can I write conditions so that it makes these updates?
DF budget:
cust type loc rev sales spend
0 abc new north 500 0 250
1 def new south 700 0 150
2 hij old south 700 0 150
DF sales:
cust type loc sales
0 abc new north 15
1 hij old south 18
DF budget outcome:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150
Any thoughts?
Assuming that 'cust' column is unique in your other df, you can call map on the sales df after setting the index to be the 'cust' column, this will map for each 'cust' in budget df to it's sales value, additionally you will get NaN where there are missing values so you call fillna(0) to fill those values:
In [76]:
df['sales'] = df['cust'].map(df1.set_index('cust')['sales']).fillna(0)
df
Out[76]:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150

Pandas Pivot tables row subtotals

I'm using Pandas 0.10.1
Considering this Dataframe:
Date State City SalesToday SalesMTD SalesYTD
20130320 stA ctA 20 400 1000
20130320 stA ctB 30 500 1100
20130320 stB ctC 10 500 900
20130320 stB ctD 40 200 1300
20130320 stC ctF 30 300 800
How can i group subtotals per state?
State City SalesToday SalesMTD SalesYTD
stA ALL 50 900 2100
stA ctA 20 400 1000
stA ctB 30 500 1100
I tried with a pivot table but i only can have subtotals in columns
table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State','City'], aggfunc=np.sum, margins=True)
I can achieve this on excel, with a pivot table.
If you put State and City not both in the rows, you'll get separate margins. Reshape and you get the table you're after:
In [10]: table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State'], cols=['City'], aggfunc=np.sum, margins=True)
In [11]: table.stack('City')
Out[11]:
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
All All 1900 130 5100
ctA 400 20 1000
ctB 500 30 1100
ctC 500 10 900
ctD 200 40 1300
ctF 300 30 800
I admit this isn't totally obvious.
You can get the summarized values by using groupby() on the State column.
Lets make some sample data first:
import pandas as pd
import StringIO
incsv = StringIO.StringIO("""Date,State,City,SalesToday,SalesMTD,SalesYTD
20130320,stA,ctA,20,400,1000
20130320,stA,ctB,30,500,1100
20130320,stB,ctC,10,500,900
20130320,stB,ctD,40,200,1300
20130320,stC,ctF,30,300,800""")
df = pd.read_csv(incsv, index_col=['Date'], parse_dates=True)
Then apply the groupby function and add a column City:
dfsum = df.groupby('State', as_index=False).sum()
dfsum['City'] = 'All'
print dfsum
State SalesToday SalesMTD SalesYTD City
0 stA 50 900 2100 All
1 stB 50 700 2200 All
2 stC 30 300 800 All
We can append the original data to the summed df by using append:
dfsum.append(df).set_index(['State','City']).sort_index()
print dfsum
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
I added the set_index and sort_index to make it look more like your example output, its not strictly necessary to get the results.
I think this subtotal example code is what you want (similar to excel subtotal).
I assume that you want group by columns A, B, C, D, then count column value of E.
main_df.groupby(['A', 'B', 'C']).apply(lambda sub_df:
sub_df.pivot_table(index=['D'], values=['E'], aggfunc='count', margins=True))
output:
E
A B C D
a a a a 1
b 2
c 2
all 5
b b a a 3
b 2
c 2
all 7
b b b a 3
b 6
c 2
d 3
all 14
How about this one ?
table = pd.pivot_table(data, index=['State'],columns = ['City'],values=['SalesToday', 'SalesMTD','SalesYTD'],\
aggfunc=np.sum, margins=True)
If you are interested I have just created a little function to make it more easy as you might want to apply this function 'subtotal' on many table. It works for both table created via pivot_table() and groupby(). An example of table to use it is provide on this stack overflow page : Sub Total in pandas pivot Table
def get_subtotal(table, sub_total='subtotal', get_total=False, total='TOTAL'):
"""
Parameters
----------
table : dataframe, table with multi-index resulting from pd.pivot_table() or
df.groupby().
sub_total : str, optional
Name given to the subtotal. The default is '_Sous-total'.
get_total : boolean, optional
Precise if you want to add the final total (in case you used groupeby()).
The default is False.
total : str, optional
Name given to the total. The default is 'TOTAL'.
Returns
-------
A table with the total and subtotal added.
"""
index_name1 = table.index.names[0]
index_name2 = table.index.names[1]
pvt = table.unstack(0)
mask = pvt.columns.get_level_values(index_name1) != 'All'
#print (mask)
pvt.loc[sub_total] = pvt.loc[:, mask].sum()
pvt = pvt.stack().swaplevel(0,1).sort_index()
pvt = pvt[pvt.columns[1:].tolist() + pvt.columns[:1].tolist()]
if get_total:
mask = pvt.index.get_level_values(index_name2) != sub_total
pvt.loc[(total, '' ),: ] = pvt.loc[mask].sum()
print (pvt)
return(pvt)
table = pd.pivot_table(df, index=['A'], values=['B', 'C'], columns=['D', 'E'], fill_value='0', aggfunc=np.sum/'count'/etc., margins=True, margins_name='Total')
print(table)

Categories