Pandas Check Multiple Conditions [duplicate] - python

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I have a small excel file that contains prices for our online store & I am trying to automate this process, however, I don't fully trust the stuff to properly qualify the data, so I wanted to use Pandas to quickly check over certain fields, I have managed to achieve everything I need so far, however, I am only a beginner and I cannot think of the proper way for the next part.
So basically I need to qualify 2 columns on the same row, we have one column MARGIN, if this column is >60, then I need to check that the MARKDOWN column on the same row is populated == YES.
So my question is, how can I code it to basically say-
Below is an example of the way I have been doing my other checks, I realise it is quite beginner-ish, but I am only a beginner.
sku2 = df['SKU_2']
comp_at = df['COMPARE AT PRICE']
sales_price = df['SALES PRICE']
dni_act = df['DO NOT IMPORT - action']
dni_fur = df['DO NOT IMPORT - further details']
promo = df['PROMO']
replacement = df['REPLACEMENT']
go_live_date = df['go live date']
markdown = df['markdown']
# sales price not blank check
for item in sales_price:
if pd.isna(item):
with open('document.csv', 'a', newline="") as fd:
writer = csv.writer(fd)
writer.writerow(['It seems there is a blank sales price in here', str(file_name)])
fd.close
break

Example:
df = pd.DataFrame([
['a',1,2],
['b',3,4],
['a',5,6]],
columns=['f1','f2','f3'])
# | represents or
print(df[(df['f1'] == 'a') & (df['f2'] > 1)])
Output:
f1 f2 f3
2 a 5 6

Related

Ho do I create a new column and values to it [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two csv files with data. They both have a common column (county). The first file just has the counties while the second file has counties together with their population. I have a script which I thought would be able to create a new column of population in the first file. Note that the order of the counties in both files is totally different.
File one:
Id
County
1
Nairobi
2
Mombasa
3
Kisumu
4
Nakuru
File two:
Id
County
Population
1
Kisumu
1,250,200
2
Nairobi
4,560,700
3
Nakuru
2,673,800
4
Mombasa
3,167,900
I wanted to create a new column in the first table as Population and parse through the second table and pick population of each county, like in the table below.
Id
County
Population
1
Nairobi
4,560,700
2
Mombasa
3,167,900
3
Kisumu
1,250,200
4
Nakuru
2,673,800
Below is my code, I got a bit confused on how to execute that. Please help.
data = pd.read_csv('counties.csv');
county_names = data['COUNTY']
ref_data = pd.read_csv('kenya-population-by-sex-and-county.csv', skiprows=8, header=None)
ref_data.columns = ['County', 'Male', 'Female', 'Intersex', 'Total']
list_count = []
for item in county_names.tolist():
compare = ref_data['County'].tolist()
pop = ref_data['Total']
if item in compare:
list_count.append(item)
pop
else:
print(item + " is not in list")```
You can simply create a pandas dataframe containing County and Population and merge it with the first dataframe where only County was there. Also there are many joining options available to fulfill your different needs.

Python Panda's row selection

I've tried doing some searching, but I'm having troubles finding what I specifically need. I currently have this.
location = 'Location'
data = pd.read_csv('testbook.csv')
df = pd.DataFrame(data)
search = 'OR' # This will be replaced with an input
row = (df[df.eq(search).any(1)])
print(row)
Location = row.at[0, location]
print(Location)
This outputs this
row print out
Location City Price Etc
0 FL OR 50 123
Location print out
FL
this is the CSV information that it's pull the data from.
My main question and issue is what I'm trying to find out is at this specific line of code
Location = row.at[0, location]
for Location what I'm trying to do and see if possible is in the brackets [0, location].
I want it to automate in the future since for example if I need to find instead of 'OR' I need to find what data is in 'OR1'. The issue is that the [0] is related to the Row # hence this(this is the entire df).
Location City Price Etc
0 FL OR 50 123
1 FL1 OR1 501 1231
2 FL2 OR2 502 1232
I would have to manually change the code every single time which of course is unfeasible with what I'm trying to accomplish.
My main question is, how do I pull specific row numbers all the way on the left and take that output and make it a variable that I can input anywhere?
I'm having a bit of trouble figuring out what you are looking for but this is my best guess
import pandas as pd
data = {'Location':['FL', 'FL1', 'FL2'],
'City': ['OR', 'OR1', 'OR2'],
'Price':[50, 501, 502],
'Etc': [123,1231,1232]}
data = pd.DataFrame(data)
df = pd.DataFrame(data)
# Given search term -> find location
search = 'OR'
# Outputs 'FL'
df['Location'][df['City'] == search].any()

I have issue of finding the negative results in pandas [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 2 years ago.
I have data calls SalesData that's contains "Profit", "Sales" and "Sub-Category"
and when I use this code
SubCategoryProfit = SalesData[["Sub-Category", "Profit"]].groupby(by = "Sub-Category").sum().sort_values(by = "Profit")
#Print the results
SalesData.style.applymap(color_negative_red, subset=['Profit','Sales'])
print(SubCategoryProfit)
I will get these results
Profit
Sub-Category
Tables -17725.4811
Bookcases -3472.5560
Supplies -1189.0995
Fasteners 949.5182
Machines 3384.7569
Labels 5546.2540
Art 6527.7870
Envelopes 6964.1767
however when I am looking for the negative results only with this code
JustSubCatProf = SalesData[["Sub-Category", "Profit"]]
NegProfFilter = SalesData["Profit"] < 0.0
JustNegSubCatProf = JustSubCatProf[NegProfFilter].groupby(by = "Sub-Category").sum().sort_values(by = "Profit")
print(JustNegSubCatProf)
I will get this!
Profit
Sub-Category
Binders -38510.4964
Tables -32412.1483
Machines -30118.6682
Bookcases -12152.2060
Chairs -9880.8413
Appliances -8629.6412
Phones -7530.6235
Furnishings -6490.9134
Storage -6426.3038
Supplies -3015.6219
There should be only 3 negative results I'm not sure what I am doing wrong.
Can someone help me please?
You can create a new dataframe with negative values ​​by filtering as follows.
SalesData[SalesData["valuecol1"] < 0]
I cannot fully understand your problem as I cannot see exactly what data it has. If you can share some of the data, I can give a clearer answer.
Once you have this result:
In [2878]: df
Out[2878]:
Profit
Sub-Category
Tables -17725.4811
Bookcases -3472.5560
Supplies -1189.0995
Fasteners 949.5182
Machines 3384.7569
Labels 5546.2540
Art 6527.7870
Envelopes 6964.1767
You can do this to get only -ve rows:
In [2880]: df[df.Profit.lt(0)]
Out[2880]:
Profit
Sub-Category
Tables -17725.4811
Bookcases -3472.5560
Supplies -1189.0995

Import CSV file where last column has many separators [duplicate]

This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']

filling in columns with info from other file based on condition

So there are 2 csv files im working with:
file 1:
City KWR1 KWR2 KWR3
Killeen
Killeen
Houston
Whatever
file2:
location link reviews
Killeen www.example.com 300
Killeen www.differentexample.com 200
Killeen www.example3.com 100
Killeen www.extraexample.com 20
Here's what im trying to make this code do:
look at the 'City' in file one, take the top 3 links in file 2 (you can go ahead and assume the cities wont get mixed up) and then put these top 3 into the KWR1 KWR2 KWR3 columns for all the same 'City' values.
so it gets the top 3 and then just copies them to the right of all the Same 'City' values.
even asking this question correctly is difficult for me, hope i've provided enough information.
i know how to read the file in with pandas and all that, just cant code this exact situation in...
It is a little unusual requirement but I think you need to three steps:
1. Keep only the first three values you actually need.
df = df.sort_values(by='reviews',ascending=False).groupby('location').head(3).reset_index()
Hopefully this keeps only the first three from every city.
Then you somehow need to label your data, there might be better ways to do this but here is one way:- You assign a new column with numbers and create a user defined function
import numpy as np
df['nums'] = np.arange(len(df))
Now you have a column full of numbers (kind of like line numbers)
You create your function then that will label your data...
def my_func(index):
if index % 3 ==0 :
x = 'KWR' + str(1)
elif index % 3 == 1:
x = 'KWR' + str(2)
elif index % 3 == 2:
x = 'KWR' + str(3)
return x
You can then create the labels you need:
df['labels'] = df.nums.apply(my_func)
Then you can do:
my_df = pd.pivot_table(df, values='reviews', index=['location'], columns='labels', aggfunc='max').reset_index()
Which literally pulls out the labels (pivots) and puts the values in to the right places.

Categories