How create a condition NOT greater using .loc in python

How create a condition NOT greater using .loc in python - python

How exclude rows from the data by condition.
1- using .loc I selected the part to be removed
2- problem is there are empty rows in "year", I want to keep all the empty and anything < 2020
I would use !< but doesn't work, python just accepts !=
# dataframe
cw=
year name
2022 as
2020 ad
sd
sd
1988 wwe
1999 we
cw = cw.loc[cw['year']!>'2020']
The Problem is the empty fields, its tricky... I need keep everything that is NOT > 2020, so I want to keep the empty and smaller values

Isn't not greater than n the same as less than or equal to n?
cw = cw.loc[cw['year']!>'2020']
simply becomes
cw = cw.loc[cw['year'] <= '2020']
negating the query will also work but it's important that your "year" column be either an int or a timestamp if you want to make sure the > operator works correctly.
Try something more like this:
import pandas as pd
cw = pd.DataFrame({"year": [2022, 2020, None, None, 1988, 1999],
"name": ["as", "ad", "sd", "sd", "wwe", "we"]}, dtype=int)
"""
year name
0 2022 as
1 2020 ad
2 None sd
3 None sd
4 1988 wwe
5 1999 we
"""
cw = cw.loc[~(cw["year"] > 2020)]
"""
year name
1 2020 ad
2 None sd
3 None sd
4 1988 wwe
5 1999 we
"""

Related

pandas plot every Nth index but always include last index

I have a plot, and I want to display only specific values. The plot looks good and not clumsy.
In the below, I want to display values every two years but I don't want miss displaying the last value.
df =
Year Total value
0 2011 11.393630
1 2012 11.379185
2 2013 10.722502
3 2014 10.304044
4 2015 9.563496
5 2016 9.048299
6 2017 9.290901
7 2018 9.470320
8 2019 9.533228
9 2020 9.593088
10 2021 9.610742
# Plot
df.plot(x='year')
# Select every other point, these values will be displayed on the chart
col_tuple = df[['Year','Total value']][::3]
for j,k in col_tuple :
plt.text(j,k*1.1,'%.2f'%(k))
plt.show()
How do I pick and show the last value as well?

I want to make sure the last value is there irrespective of the range or slice
The simplest way is to define the range/slice in reverse, e.g. [::-3]:
col_tuple = df[['Year', 'Total value']][::-3]
# Year Total value
# 10 2021 9.610742
# 7 2018 9.470320
# 4 2015 9.563496
# 1 2012 11.379185
df.plot('Year')
for x, y in col_tuple.itertuples(index=False):
plt.text(x, y*1.01, f'{y:.2f}')
If you want to ensure both the last and first index, use Index.union to combine the (forward) sliced index and last index:
idx = df.index[::3].union([df.index[-1]])
col_tuple = df[['Year', 'Total value']].iloc[idx]
# Year Total value
# 0 2011 11.393630
# 3 2014 10.304044
# 6 2017 9.290901
# 9 2020 9.593088
# 10 2021 9.610742
df.plot('Year')
for x, y in col_tuple.itertuples(index=False):
plt.text(x, y*1.01, f'{y:.2f}')

weighted average aggregation on multiple columns of df

I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?

Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444

Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)

Pandas better way to add up value counts from different data-frames

There is a huge CSV file that is being read by pd.read_table('file.csv', chunksize=50000 ). Currently with each loop iteration I read the value_counts relevant to the current chunk using the df.col.value_counts() method. I got it working through loops and tricks with numpy, but I'm wondering if there is a cleaner way to do this using pandas?
Code:
prev = None
# LOOP CHUNK DATA
for imdb_basics in pd.read_table(
'data/imdb.title.basics.tsv',
dtype={'tconst':str,'originalTitle':str,'startYear':str },
usecols=['tconst','originalTitle','startYear'],
chunksize=50000,
sep='\t'
):
# REMOVE NULL DATA & CONVERT TO NUMBER
imdb_basics.startYear = imdb_basics.startYear.replace( "\\N", 0 )
imdb_basics.startYear = pd.to_numeric( imdb_basics.startYear )
# --- loops and tricks --- !
tmp = imdb_basics.startYear.value_counts( sort=False )
current = {
'year': list( tmp.keys() ),
'count': list( tmp.values )
}
if prev is None :
prev = current
else:
for i in range( len( prev['year'] ) ):
for j in range( len( current['year'] ) ):
if prev['year'][i] == current['year'][j]:
prev['count'][i] += current['count'][j]
for i in range( len( current['year'] ) ):
if not ( current['year'][i] in prev['year'] ):
prev['year'].append( current['year'][i] )
prev['count'].append( current['count'][i] )
EDIT:
I'm working with a large data file, plus the remote machine I'm currently using has a very limited amount of memory, so removing chunking in pandas is not an option.

Like I said in my comments, you don't need to worry about the key management. Pandas can do all of that for you. Consider this trivial example with some mock data with a year column and some other column:
from io import StringIO
import numpy
import pandas
numpy.random.seed(0)
# years to chose from
years = numpy.arange(2000, 2017)
# relative probabilities of a year being selected (2000 should be absent)
weights = numpy.linspace(0.0, 0.7, num=len(years))
weights /= weights.sum()
# fake dataframe turned into a fake CSV
x = numpy.random.choice(years, size=200, p=weights)
text = pandas.DataFrame({
'year': x,
'value': True
}).to_csv()
Since this is a small file, we can read it all at once to get the "correct" answer
pandas.read_csv(StringIO(text))['year'].value_counts().sort_index()
2001 1
2002 6
2003 2
2004 6
2005 6
2006 11
2007 9
2008 12
2009 13
2010 9
2011 18
2012 16
2013 29
2014 20
2015 21
2016 21
Name: year, dtype: int64
OK, so now let's try a chunking approaching, using pandas methods:
result = None
for chunk in pandas.read_csv(StringIO(text), chunksize=25):
tmp = chunk['year'].value_counts()
if result is None: # first chunk
result = tmp.copy()
else: # all other chunks
result = result.add(tmp, fill_value=0).astype(int)
final = result.sort_index()
final
2001 1
2002 6
2003 2
2004 6
2005 6
2006 11
2007 9
2008 12
2009 13
2010 9
2011 18
2012 16
2013 29
2014 20
2015 21
2016 21
Name: year, dtype: int64
So it works. Pandas will align and fill the index during basic operations.

You could try dask.dataframe. It is underused because it only offers a subset of pandas functionality. But if the problem is the ugly syntax via chunking, you could try this:
import dask.dataframe as dd
df = dd.read_csv('my_big_file.csv')
counts = df['col'].value_counts()
counts.compute()
Internally, dask deals with chunking, aggregation, etc.

Map dataframe column value by another column's value

My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'

You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017

You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0

looping through columns and adjusting values pandas

I have a df with around 100,000 rows and 1,000 columns and need to make some adjustments based on the existing data. How do I best approach this? Most of the changes will follow this basic formula:
search a column (or two or three) to see if a condition is met
if met, change the values of dozens or hundreds of columns in that row
This is my best attempt, where I created a list of the columns and was looking to see whether the first column contained the value 1. Where it did, I wanted to just add some number. That part worked, but it only worked on the FIRST row, not on all the 1s in the column. To fix that, I think I need to create a loop where I have the second [i] that goes through all the rows, but I wasn't sure if I was approaching the entire problem incorrectly. FWIW, test_cols = list of columns and testing_2 is my df.
def try_this(test_cols):
for i in range(len(test_cols)):
if i == 0 and testing_2[test_cols[i]][i] == 1:
testing_2[test_cols[i]][i]=testing_2[test_cols[i]][i]+78787
i+=1
return test_cols
Edit/example:
Year Month Mean_Temp
City
Madrid 1999 Jan 7--this value should appear twice
Bilbao 1999 Jan 9--appear twice
Madrid 1999 Feb 9
Bilbao 1999 Feb 10
. . . .
. . . .
. . . .
Madrid 2000 Jan 6.8--this value should go away
Bilbao 2000 Jan 9.2--gone
So I would need to do something like (using your answer):
def alter(row):
if row['Year'] == 2000 and row['Month'] == 'Jan':
row['Mean_Temp'] = row['Mean_Temp'] #from year 1999!
return row['Mean_Temp']
else:
return row['Mean_Temp']

One way you could do this is by creating a function and applying it. Suppose you want to increase column 'c' by a factor of 10 if the corresponding row in 'a' or 'b' is an even number.
import pandas as pd
data = {'a':[1,2,3,4],'b':[3,6,8,12], 'c':[1,2,3,4]}
df = pd.DataFrame(data)
def alter(row):
if row['a']%2 == 0 or row['b']%2 == 0:
return row['b']*10
else:
return row['b']
df['c'] = df.apply(alter, axis=1)
would create a df that looks like,
a b c
0 1 3 3
1 2 6 60
2 3 8 80
3 4 12 120
Edit to add:
If you want to apply values from other parts of the df you could put those in a dict and then pass that into your apply function.
import pandas as pd
data = {'Cities':['Madrid', 'Balbao'] * 3, 'Year':[1999] * 4 + [2000] * 2,
'Month':['Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Jan'],
'Mean_Temp':[7, 9, 9, 10, 6.8, 9.2]}
df = pd.DataFrame(data)
df = df[['Cities', 'Year', 'Month', 'Mean_Temp']]
#create dicitonary with the values from 1999
edf = df[df.Year == 1999]
keys = zip(edf.Cities, edf.Month)
values = edf.Mean_Temp
dictionary = dict(zip(keys, values))
def alter(row, dictionary):
if row['Year'] == 2000 and row['Month'] == 'Jan':
return dictionary[(row.Cities, row.Month)]
else:
return row['Mean_Temp']
df['Mean_Temp'] = df.apply(alter, args = (dictionary,), axis=1)
Which gives you a df that looks like,
Cities Year Month Mean_Temp
0 Madrid 1999 Jan 7
1 Balbao 1999 Jan 9
2 Madrid 1999 Feb 9
3 Balbao 1999 Feb 10
4 Madrid 2000 Jan 7
5 Balbao 2000 Jan 9
Of course you can change the parameters however you like. Hope this helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How create a condition NOT greater using .loc in python - python

Related

pandas plot every Nth index but always include last index

weighted average aggregation on multiple columns of df

Pandas better way to add up value counts from different data-frames

Map dataframe column value by another column's value

looping through columns and adjusting values pandas

Categories

Resources