I have issue of finding the negative results in pandas [duplicate] - python

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 2 years ago.
I have data calls SalesData that's contains "Profit", "Sales" and "Sub-Category"
and when I use this code
SubCategoryProfit = SalesData[["Sub-Category", "Profit"]].groupby(by = "Sub-Category").sum().sort_values(by = "Profit")
#Print the results
SalesData.style.applymap(color_negative_red, subset=['Profit','Sales'])
print(SubCategoryProfit)
I will get these results
Profit
Sub-Category
Tables -17725.4811
Bookcases -3472.5560
Supplies -1189.0995
Fasteners 949.5182
Machines 3384.7569
Labels 5546.2540
Art 6527.7870
Envelopes 6964.1767
however when I am looking for the negative results only with this code
JustSubCatProf = SalesData[["Sub-Category", "Profit"]]
NegProfFilter = SalesData["Profit"] < 0.0
JustNegSubCatProf = JustSubCatProf[NegProfFilter].groupby(by = "Sub-Category").sum().sort_values(by = "Profit")
print(JustNegSubCatProf)
I will get this!
Profit
Sub-Category
Binders -38510.4964
Tables -32412.1483
Machines -30118.6682
Bookcases -12152.2060
Chairs -9880.8413
Appliances -8629.6412
Phones -7530.6235
Furnishings -6490.9134
Storage -6426.3038
Supplies -3015.6219
There should be only 3 negative results I'm not sure what I am doing wrong.
Can someone help me please?

You can create a new dataframe with negative values ​​by filtering as follows.
SalesData[SalesData["valuecol1"] < 0]
I cannot fully understand your problem as I cannot see exactly what data it has. If you can share some of the data, I can give a clearer answer.

Once you have this result:
In [2878]: df
Out[2878]:
Profit
Sub-Category
Tables -17725.4811
Bookcases -3472.5560
Supplies -1189.0995
Fasteners 949.5182
Machines 3384.7569
Labels 5546.2540
Art 6527.7870
Envelopes 6964.1767
You can do this to get only -ve rows:
In [2880]: df[df.Profit.lt(0)]
Out[2880]:
Profit
Sub-Category
Tables -17725.4811
Bookcases -3472.5560
Supplies -1189.0995

Related

How to split csv data

I have a problem where I got a csv data like this:
AgeGroup Where do you hear our company from? How long have you using our platform?
18-24 Word of mouth; Supermarket Product 0-1 years
36-50 Social Media; Word of mouth 1-2 years
18-24 Advertisement +4 years
and I tried to make the file into this format through either jupyter notebook or from excel csv:
AgeGroup Where do you hear our company from?
18-24 Word of mouth 0-1 years
18-24 Supermarket Product 0-1 years
36-50 Social Media 1-2 years
36-50 Word of mouth 1-2 years
18-24 Advertisement +4 years
Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook:
data = pd.read_csv('Untitled form.csv')
Can anyone tell me how should I do it?
I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column
Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252'
So it would become like this:
import pandas as pd
data_split = pd.read_csv('Untitled form split.csv',
skipinitialspace=True,
usecols=range(1,7),
encoding='cp1252')
However if there's a more efficient way, please let me know. Thanks
I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it.
import pandas as pd
data = pd.read_fwf('Untitled form.csv')
cols = data.columns
data_long = pd.DataFrame(columns=data.columns)
for idx, row in data.iterrows():
hear_from = row['Where do you hear our company from?'].split(';')
hear_from_fmt = list(map(lambda x: x.strip(), hear_from))
n_items = len(hear_from_fmt)
d = {
cols[0] : [row[0]]*n_items,
cols[1] : hear_from_fmt,
cols[2] : [row[2]]*n_items,
}
data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True)
Let's brake it down.
This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help.
Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together.
Now there might be a more efficient solution, but this should work.

How can I get the sum of one column based on year, which is stored in another column?

I have this code.
cheese_sums = []
for year in milk_products.groupby(milk_products['Date']):
total = milk_products[milk_products['Date'] == year]['Cheddar Cheese Production (Thousand Tonnes)'].sum()
cheese_sums.append(total)
print(cheese_sums)
I am trying to sum all the Cheddar Cheese Production, which are stored as floats in the milk_products data frame. The Date column is a datetime object that holds only the year, but has 12 values representing each month. As it's written now, I can only print a list of six 0.0's.
I got it. It should be:
cheese_sums = []
for year in milk_products['Date']:
total = milk_products[milk_products['Date'] == year]['Cheddar Cheese Production (Thousand Tonnes)'].sum()
if total not in cheese_sums:
cheese_sums.append(total)
print(cheese_sums)
You seem to think too complicated.
Try groupby(...).sum()
df = milk_products.groupby('Date').sum()

Sorting in pandas by multiple column without distorting index

I have just started out with Pandas and I am trying to do a multilevel sorting of data by columns. I have four columns in my data: STNAME, CTYNAME, CENSUS2010POP, SUMLEV. I want to set the index of my data by columns: STNAME, CTYNAME and then sort the data by CENSUS2010POP. After I set the index the appears like in pic 1 (before sorting by CENSUS2010POP) and when I sort and the data appears like pic 2 (After sorting). You can see Indices are messy and no longer sorted serially.
I have read out a few posts including this one (Sorting a multi-index while respecting its index structure) which dates back to five years ago and does not work while I write them. I am yet to learn the group by function.
Could you please tell me a way I can achieve this?
ps: I come from a accounting/finance background and very new to coding. I have just completed two Python course including PY4E.com
used this below code to set the index
census_dfq6 = census_dfq6.set_index(['STNAME','CTYNAME'])
and, used the below code to sort the data:
census_dfq6 = census_dfq6.sort_values (by = ['CENSUS2010POP'], ascending = [False] )
sample data I am working, I would love to share the csv file but I don't see a way to share this.
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50
Required End Result:
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50

Pandas Check Multiple Conditions [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I have a small excel file that contains prices for our online store & I am trying to automate this process, however, I don't fully trust the stuff to properly qualify the data, so I wanted to use Pandas to quickly check over certain fields, I have managed to achieve everything I need so far, however, I am only a beginner and I cannot think of the proper way for the next part.
So basically I need to qualify 2 columns on the same row, we have one column MARGIN, if this column is >60, then I need to check that the MARKDOWN column on the same row is populated == YES.
So my question is, how can I code it to basically say-
Below is an example of the way I have been doing my other checks, I realise it is quite beginner-ish, but I am only a beginner.
sku2 = df['SKU_2']
comp_at = df['COMPARE AT PRICE']
sales_price = df['SALES PRICE']
dni_act = df['DO NOT IMPORT - action']
dni_fur = df['DO NOT IMPORT - further details']
promo = df['PROMO']
replacement = df['REPLACEMENT']
go_live_date = df['go live date']
markdown = df['markdown']
# sales price not blank check
for item in sales_price:
if pd.isna(item):
with open('document.csv', 'a', newline="") as fd:
writer = csv.writer(fd)
writer.writerow(['It seems there is a blank sales price in here', str(file_name)])
fd.close
break
Example:
df = pd.DataFrame([
['a',1,2],
['b',3,4],
['a',5,6]],
columns=['f1','f2','f3'])
# | represents or
print(df[(df['f1'] == 'a') & (df['f2'] > 1)])
Output:
f1 f2 f3
2 a 5 6

How to iterate over a data frame

I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.
In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.

Categories