I need to be able to do the subtraction between the bigger and smaller positive number in ColumnB that have ColumnA=1,ColumnC=0. (I wrote them between ** to make it clearer)
I have something like this in my csv file but I have a lot more rows.
| Column A | Column B || Column C | Column D |
| -------- | -------- || -------- | -------- |
| 1 | -99 || 0 | 0.4567 |
| 1 | -99 || 0 | 0.5678 |
| 1 | 60 || 40 | 0.123 |
| 1 | 67 || 60 | 0.2894 |
| 1 | **69** || 0 | 0.3983 |
| 1 | 70 || 0 | 0.3983 |
| 1 | **71** || 0 | 0.3983 |
| 2 | -30 || 0 | 0.3983 |
| 2 | -40 || 20 | 0.3983 |
| 2 | 45 || 30 | 0.3983 |
| 2 | 46 || 40 | 0.3983 |
I tried to create a new column like this but I don't have to do the mean I need to subtract the max with the min.
for u in range(1, 19):
ColumnZ = df.query(f'ColumnB >0 & ColumnC == 0 & ColumnA == {u}')['ColumnB'].mean()
test.loc[rowIndex, 'ColumnZ'] = ColumnZ
cat subtract.csv
Column A,Column B,Column C,Column D
1,-99,0,0.4567
1,67,60,0.2894
1,69,0,0.3983
1,71,0,0.3983
import csv
with open('subtract.csv', 'r', newline='') as csv_file:
dReader = csv.DictReader(csv_file)
number_list = []
for row in dReader:
if int(row['Column A']) == 1 and int(row['Column C']) == 0 and int(row['Column B']) >= 0:
number_list.append(int(row['Column B']) )
new_val = max(number_list) - min(number_list)
new_val
2
I created those lists to store things i generate over a for loop which is shown a bit further down.
neutralScore = []
lightPosScore = []
middlePosScore = []
heavyPosScore = []
lightNegScore = []
middleNegScore = []
heavyNegScore = []
Here comes the loop
score = float
convertStringToInt = get_sent_score_comp_df.apply(lambda x: float(x))
for score in convertStringToInt:
if score <= 0.09 and score >= -0.09:
neutralScore.append(score)
elif score >= 0.091 and score <= 0.49:
lightPosScore.append(score)
I want to save my sentiments into those lists, then join them to be able to convert them into a DataFrame to store them in a MySQl DB.
Is there an elegant way to do this?
scores = pandas.DataFrame(data=neutralScore).append(lightPosScore, middlePosScore, heavyPosScore).append(
lightNegScore,
middleNegScore,
heavyNegScore).columns = ['heavyPosScore', 'middlePosScore', 'lightPosScore', 'neutralScore', 'lightNegScore', 'middleNegScore', 'heavyNegScore']
I know, that declaring the list of columns needs to be done separately, but until now the code looks like that.
So far I tried this and it doesn't work since it returns:
Can only merge Series or DataFrame objects, a <class 'list'> was passed
which is understandable, but I can´t think of a way to solve the problem right now.
It is not entirely clear from question how you want your final dataframe to look like. But I would do something like:
>>> import numpy as np
>>> import pandas as pd
>>> data = np.random.normal(0, 1, size=10)
>>> df = pd.DataFrame(data, columns=['score'])
>>> df
| | score |
|---:|----------:|
| 0 | -0.440304 |
| 1 | -0.597293 |
| 2 | -1.80229 |
| 3 | -1.65654 |
| 4 | -1.14571 |
| 5 | -0.760086 |
| 6 | 0.244437 |
| 7 | 0.828856 |
| 8 | -0.136325 |
| 9 | -0.325836 |
>>> df['neutralScore'] = (df.score <= 0.09) & (df.score >= -0.09)
>>> df['lightPosScore'] = (df.score >= 0.091) & (df.score <= 0.49)
And similar for the columns 'heavyPosScore', 'middlePosScore', 'lightNegScore', 'middleNegScore', 'heavyNegScore'.
>>> df
| | score | neutralScore | lightPosScore |
|---:|-----------:|:---------------|:----------------|
| 0 | -0.475571 | False | False |
| 1 | 0.109076 | False | True |
| 2 | 0.809947 | False | False |
| 3 | 0.595088 | False | False |
| 4 | 0.832727 | False | False |
| 5 | -1.30049 | False | False |
| 6 | 0.245578 | False | True |
| 7 | 0.0998278 | False | True |
| 8 | 0.20592 | False | True |
| 9 | 0.372493 | False | True |
You can then easily filter on the type of score like this:
>>> df[df.lightPosScore]
| | score | neutralScore | lightPosScore |
|---:|---------:|:---------------|:----------------|
| 0 | 0.415629 | False | True |
| 2 | 0.104852 | False | True |
| 4 | 0.39739 | False | True |
Edit
To have one rating colmn, first define a function to give your ratings and apply that to your score column:
>>> def get_rating(score):
if score <= 0.09 and score >= -0.09:
return 'neutralScore'
elif score >= 0.091 and score <= 0.49:
return 'lightPosScore'
else:
return 'to be implemented'
>>> df['rating'] = df['score'].apply(get_rating)
>>> df
| | score | rating |
|---:|-----------:|:------------------|
| 0 | -0.190816 | to be implemented |
| 1 | 0.495197 | to be implemented |
| 2 | -1.20576 | to be implemented |
| 3 | -0.711516 | to be implemented |
| 4 | -0.0606396 | neutralScore |
| 5 | 0.0452575 | neutralScore |
| 6 | 0.154754 | lightPosScore |
| 7 | -0.506285 | to be implemented |
| 8 | -0.896066 | to be implemented |
| 9 | 0.523198 | to be implemented |
I have a column of numbers in a Python Pandas df: 1,8,4,3,1,5,1,4,2
If I create a cumulative sum column it returns the cumulative sum. How do I only return the rows that reaches a cumulative sum of 20 skipping numbers that take cumulative sum over 20?
+-----+-------+------+
| Var | total | cumu |
+-----+-------+------+
| a | 1 | 1 |
| b | 8 | 9 |
| c | 4 | 13 |
| d | 3 | 16 |
| e | 1 | 17 |
| f | 5 | 22 |
| g | 1 | 23 |
| h | 4 | 27 |
| i | 2 | 29 |
+-----+-------+------+
Desired output:
+-----+-------+------+
| Var | total | cumu |
+-----+-------+------+
| a | 1 | 1 |
| b | 8 | 9 |
| c | 4 | 13 |
| d | 3 | 16 |
| e | 1 | 17 |
| g | 1 | 18 |
| i | 2 | 20 |
+-----+-------+------+
If I understood your question correctly, you want only skip values that get you over cumulative sum of 20:
def acc(total):
s, rv = 0, []
for v, t in zip(total.index, total):
if s + t <= 20:
s += t
rv.append(v)
return rv
df = df[df.index.isin(acc(df.total))]
df['cumu'] = df.total.cumsum()
print(df)
Prints:
Var total cumu
0 a 1 1
1 b 8 9
2 c 4 13
3 d 3 16
4 e 1 17
6 g 1 18
8 i 2 20
I have the following pandas dataframe:
df['price_if_0005'] = df['price'] % Decimal('0.0005')
print(tabulate(df, headers='keys', tablefmt='psql'))
+-----+---------+-------------+-----------------+-----------------+
| | price | tpo_count | tpo | price_if_0005 |
|-----+---------+-------------+-----------------+-----------------|
| 0 | 1.4334 | 1 | n | 0.0004 |
| 1 | 1.4335 | 1 | n | 0 |
| 2 | 1.4336 | 1 | n | 0.0001 |
| 3 | 1.4337 | 1 | n | 0.0002 |
| 4 | 1.4338 | 1 | n | 0.0003 |
| 5 | 1.4339 | 1 | n | 0.0004 |
| 6 | 1.434 | 1 | n | 0 |
| 7 | 1.4341 | 1 | n | 0.0001 |
| 8 | 1.4342 | 3 | noq | 0.0002 |
| 9 | 1.4343 | 3 | noq | 0.0003 |
| 10 | 1.4344 | 3 | noq | 0.0004 |
I want another column which will be empty string or the value from 'price' column when 'price_if_0005' is 0.
I.E. This would be the desired resulting table:
+-----+---------+-------------+-----------------+-----------------+--------+
| | price | tpo_count | tpo | price_if_0005 | label |
|-----+---------+-------------+-----------------+-----------------|--------+
| 0 | 1.4334 | 1 | n | 0.0004 | |
| 1 | 1.4335 | 1 | n | 0 | 1.4335 |
| 2 | 1.4336 | 1 | n | 0.0001 | |
| 3 | 1.4337 | 1 | n | 0.0002 | |
| 4 | 1.4338 | 1 | n | 0.0003 | |
| 5 | 1.4339 | 1 | n | 0.0004 | |
| 6 | 1.4340 | 1 | n | 0 | 1.4340 |
| 7 | 1.4341 | 1 | n | 0.0001 | |
| 8 | 1.4342 | 3 | noq | 0.0002 | |
| 9 | 1.4343 | 3 | noq | 0.0003 | |
| 10 | 1.4344 | 3 | noq | 0.0004 | |
I have tried:
df['label'] = ['' if x == 0 else str(y) for x,y in df['price_if_0005'], df['price']]
But I get:
File "<ipython-input-67-90c17f2505bf>", line 3
df['label'] = ['' if x == 0 else str(y) for x,y in df['price_if_0005'], df['price']]
^
SyntaxError: invalid syntax
just use .loc with pandas conditions to assign just the rows you need:
df.loc[df['price_if_0005'] == 0, 'label'] = df['price']
full example:
import pandas as pd
from io import StringIO
s = """
price | tpo_count | tpo | price_if_0005
0 | 1.4334 | 1 | n | 0.0004
1 | 1.4335 | 1 | n | 0
2 | 1.4336 | 1 | n | 0.0001
3 | 1.4337 | 1 | n | 0.0002
4 | 1.4338 | 1 | n | 0.0003
5 | 1.4339 | 1 | n | 0.0004
6 | 1.434 | 1 | n | 0
7 | 1.4341 | 1 | n | 0.0001
8 | 1.4342 | 3 | noq | 0.0002
9 | 1.4343 | 3 | noq | 0.0003
10 | 1.4344 | 3 | noq | 0.0004 """
df = pd.read_csv(StringIO(s), sep="\s+\|\s+")
df.loc[df['price_if_0005'] == 0, 'label'] = df['price']
df['label'].fillna('',inplace=True)
print(df)
Output:
price tpo_count tpo price_if_0005 label
0 1.4334 1 n 0.0004
1 1.4335 1 n 0.0000 1.4335
2 1.4336 1 n 0.0001
3 1.4337 1 n 0.0002
4 1.4338 1 n 0.0003
5 1.4339 1 n 0.0004
6 1.4340 1 n 0.0000 1.434
7 1.4341 1 n 0.0001
8 1.4342 3 noq 0.0002
9 1.4343 3 noq 0.0003
10 1.4344 3 noq 0.0004
This is different than the usual 'subtract until 0' questions on here as it is conditional on another column. This question is about creating that conditional column.
This dataframe consists of three columns.
Column 'quantity' tells you how much to add/subtract.
Column 'in' tells you when to subtract.
Column 'cumulative_in' tells you how much you have.
+----------+----+---------------+
| quantity | in | cumulative_in |
+----------+----+---------------+
| 5 | 0 | |
| 1 | 0 | |
| 3 | 1 | 3 |
| 4 | 1 | 7 |
| 2 | 1 | 9 |
| 1 | 0 | |
| 1 | 0 | |
| 3 | 0 | |
| 1 | -1 | |
| 2 | 0 | |
| 1 | 0 | |
| 2 | 0 | |
| 3 | 0 | |
| 3 | 0 | |
| 1 | 0 | |
| 3 | 0 | |
+----------+----+---------------+
Whenever column 'in' equals -1, starting from next row I want to create a column 'out' (0/1) that tells it to keep subtracting until 'cumulative_in' reaches 0. Doing it by hand,
Column 'out' tells you when to keep subtracting.
Column 'cumulative_subtracted' tells you how much you have already subtracted.
I subtract column 'cumulative_in' by 'cumulative_subtracted' until it reaches 0, the output looks something like this:
+----------+----+---------------+-----+-----------------------+
| quantity | in | cumulative_in | out | cumulative_subtracted |
+----------+----+---------------+-----+-----------------------+
| 5 | 0 | | | |
| 1 | 0 | | | |
| 3 | 1 | 3 | | |
| 4 | 1 | 7 | | |
| 2 | 1 | 9 | | |
| 1 | 0 | | | |
| 1 | 0 | | | |
| 3 | 0 | | | |
| 1 | -1 | | | |
| 2 | 0 | 7 | 1 | 2 |
| 1 | 0 | 6 | 1 | 3 |
| 2 | 0 | 4 | 1 | 5 |
| 3 | 0 | 1 | 1 | 8 |
| 3 | 0 | 0 | 1 | 9 |
| 1 | 0 | | | |
| 3 | 0 | | | |
+----------+----+---------------+-----+-----------------------+
I couldn't find a vector solution to this. I would love to see one. However, the problem is not that hard when going through row by row. I hope your dataframe is not too big!!
First set up the data.
data = {
"quantity": [
5,1,3,4,2,1,1,3,1,2,1,2,3,3,1,3
],
"in":[
0,0,1,1,1,0,0,0,-1,0,0,0,0,0,0,0
],
"cumulative_in": [
np.NaN,np.NaN,3,7,9,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN
]
}
Then set up the dataframe and extra columns. I used np.NaN for the 'out' but 0 was easier for 'cumulative_subtracted'
df=pd.DataFrame(data)
df['out'] = np.NaN
df['cumulative_subtracted'] = 0
Set the initial variables
last_in = 0.
reduce = False
Go through the dataframe row by row, unfortunately.
for i in df.index:
# check if necessary to adjust last_in value.
if ~np.isnan(df.at[i, "cumulative_in"]) and reduce == False:
last_in = df.at[i, "cumulative_in"]
# check if -1 and change reduce to true
elif df.at[i, "in"] == -1:
reduce = True
# check if reduce true, the implement reductions
elif reduce == True:
df.at[i, "out"] = 1
if df.at[i, "quantity"] <= last_in:
last_in -= df.at[i, "quantity"]
df.at[i, "cumulative_in"] = last_in
df.at[i, "cumulative_subtracted"] = (
df.at[i - 1, "cumulative_subtracted"] + df.at[i, "quantity"]
)
elif df.at[i, "quantity"] > last_in:
df.at[i, "cumulative_in"] = 0
df.at[i, "cumulative_subtracted"] = (
df.at[i - 1, "cumulative_subtracted"] + last_in
)
last_in = 0
reduce = False
This works for the data given, and hopefully for all your dataset.
print(df)
quantity in cumulative_in out cumulative_subtracted
0 5 0 NaN NaN 0
1 1 0 NaN NaN 0
2 3 1 3.0 NaN 0
3 4 1 7.0 NaN 0
4 2 1 9.0 NaN 0
5 1 0 NaN NaN 0
6 1 0 NaN NaN 0
7 3 0 NaN NaN 0
8 1 -1 NaN NaN 0
9 2 0 7.0 1.0 2
10 1 0 6.0 1.0 3
11 2 0 4.0 1.0 5
12 3 0 1.0 1.0 8
13 3 0 0.0 1.0 9
14 1 0 NaN NaN 0
15 3 0 NaN NaN 0
It is not clear for me what happens when the quantity to subtract has not yet reached zero and you have another '1' in the 'in' column.
Yet, here is a rough solution for a simple case:
import pandas as pd
import numpy as np
size = 20
df = pd.DataFrame(
{
"quantity": np.random.randint(1, 6, size),
"in": np.full(size, np.nan),
}
)
# These are just to place a random 1 and -1 into 'in', not important
df.loc[np.random.choice(df.iloc[:size//3, :].index, 1), 'in'] = 1
df.loc[np.random.choice(df.iloc[size//3:size//2, :].index, 1), 'in'] = -1
df.loc[np.random.choice(df.iloc[size//2:, :].index, 1), 'in'] = 1
# Fill up with 1/-1 values the missing values after each entry up to the
# next 1/-1 entry.
df.loc[:, 'in'] = df['in'].fillna(method='ffill')
# Calculates the cumulative sum with a negative value for subtractions
df["cum_in"] = (df["quantity"] * df['in']).cumsum()
# Subtraction indicator and cumulative column
df['out'] = (df['in'] == -1).astype(int)
df["cumulative_subtracted"] = df.loc[df['in'] == -1, 'quantity'].cumsum()
# Remove values when the 'cum_in' turns to negative
df.loc[
df["cum_in"] < 0 , ["in", "cum_in", "out", "cumulative_subtracted"]
] = np.NaN
print(df)