How create a condition NOT greater using .loc in python - python

How exclude rows from the data by condition.
1- using .loc I selected the part to be removed
2- problem is there are empty rows in "year", I want to keep all the empty and anything < 2020
I would use !< but doesn't work, python just accepts !=
# dataframe
cw=
year name
2022 as
2020 ad
sd
sd
1988 wwe
1999 we
cw = cw.loc[cw['year']!>'2020']
The Problem is the empty fields, its tricky... I need keep everything that is NOT > 2020, so I want to keep the empty and smaller values

Isn't not greater than n the same as less than or equal to n?
cw = cw.loc[cw['year']!>'2020']
simply becomes
cw = cw.loc[cw['year'] <= '2020']
negating the query will also work but it's important that your "year" column be either an int or a timestamp if you want to make sure the > operator works correctly.
Try something more like this:
import pandas as pd
cw = pd.DataFrame({"year": [2022, 2020, None, None, 1988, 1999],
"name": ["as", "ad", "sd", "sd", "wwe", "we"]}, dtype=int)
"""
year name
0 2022 as
1 2020 ad
2 None sd
3 None sd
4 1988 wwe
5 1999 we
"""
cw = cw.loc[~(cw["year"] > 2020)]
"""
year name
1 2020 ad
2 None sd
3 None sd
4 1988 wwe
5 1999 we
"""

Related

pandas plot every Nth index but always include last index

I have a plot, and I want to display only specific values. The plot looks good and not clumsy.
In the below, I want to display values every two years but I don't want miss displaying the last value.
df =
Year Total value
0 2011 11.393630
1 2012 11.379185
2 2013 10.722502
3 2014 10.304044
4 2015 9.563496
5 2016 9.048299
6 2017 9.290901
7 2018 9.470320
8 2019 9.533228
9 2020 9.593088
10 2021 9.610742
# Plot
df.plot(x='year')
# Select every other point, these values will be displayed on the chart
col_tuple = df[['Year','Total value']][::3]
for j,k in col_tuple :
plt.text(j,k*1.1,'%.2f'%(k))
plt.show()
How do I pick and show the last value as well?
I want to make sure the last value is there irrespective of the range or slice
The simplest way is to define the range/slice in reverse, e.g. [::-3]:
col_tuple = df[['Year', 'Total value']][::-3]
# Year Total value
# 10 2021 9.610742
# 7 2018 9.470320
# 4 2015 9.563496
# 1 2012 11.379185
df.plot('Year')
for x, y in col_tuple.itertuples(index=False):
plt.text(x, y*1.01, f'{y:.2f}')
If you want to ensure both the last and first index, use Index.union to combine the (forward) sliced index and last index:
idx = df.index[::3].union([df.index[-1]])
col_tuple = df[['Year', 'Total value']].iloc[idx]
# Year Total value
# 0 2011 11.393630
# 3 2014 10.304044
# 6 2017 9.290901
# 9 2020 9.593088
# 10 2021 9.610742
df.plot('Year')
for x, y in col_tuple.itertuples(index=False):
plt.text(x, y*1.01, f'{y:.2f}')

weighted average aggregation on multiple columns of df

I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?
Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444
Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)

Pandas better way to add up value counts from different data-frames

There is a huge CSV file that is being read by pd.read_table('file.csv', chunksize=50000 ). Currently with each loop iteration I read the value_counts relevant to the current chunk using the df.col.value_counts() method. I got it working through loops and tricks with numpy, but I'm wondering if there is a cleaner way to do this using pandas?
Code:
prev = None
# LOOP CHUNK DATA
for imdb_basics in pd.read_table(
'data/imdb.title.basics.tsv',
dtype={'tconst':str,'originalTitle':str,'startYear':str },
usecols=['tconst','originalTitle','startYear'],
chunksize=50000,
sep='\t'
):
# REMOVE NULL DATA & CONVERT TO NUMBER
imdb_basics.startYear = imdb_basics.startYear.replace( "\\N", 0 )
imdb_basics.startYear = pd.to_numeric( imdb_basics.startYear )
# --- loops and tricks --- !
tmp = imdb_basics.startYear.value_counts( sort=False )
current = {
'year': list( tmp.keys() ),
'count': list( tmp.values )
}
if prev is None :
prev = current
else:
for i in range( len( prev['year'] ) ):
for j in range( len( current['year'] ) ):
if prev['year'][i] == current['year'][j]:
prev['count'][i] += current['count'][j]
for i in range( len( current['year'] ) ):
if not ( current['year'][i] in prev['year'] ):
prev['year'].append( current['year'][i] )
prev['count'].append( current['count'][i] )
EDIT:
I'm working with a large data file, plus the remote machine I'm currently using has a very limited amount of memory, so removing chunking in pandas is not an option.
Like I said in my comments, you don't need to worry about the key management. Pandas can do all of that for you. Consider this trivial example with some mock data with a year column and some other column:
from io import StringIO
import numpy
import pandas
numpy.random.seed(0)
# years to chose from
years = numpy.arange(2000, 2017)
# relative probabilities of a year being selected (2000 should be absent)
weights = numpy.linspace(0.0, 0.7, num=len(years))
weights /= weights.sum()
# fake dataframe turned into a fake CSV
x = numpy.random.choice(years, size=200, p=weights)
text = pandas.DataFrame({
'year': x,
'value': True
}).to_csv()
Since this is a small file, we can read it all at once to get the "correct" answer
pandas.read_csv(StringIO(text))['year'].value_counts().sort_index()
2001 1
2002 6
2003 2
2004 6
2005 6
2006 11
2007 9
2008 12
2009 13
2010 9
2011 18
2012 16
2013 29
2014 20
2015 21
2016 21
Name: year, dtype: int64
OK, so now let's try a chunking approaching, using pandas methods:
result = None
for chunk in pandas.read_csv(StringIO(text), chunksize=25):
tmp = chunk['year'].value_counts()
if result is None: # first chunk
result = tmp.copy()
else: # all other chunks
result = result.add(tmp, fill_value=0).astype(int)
final = result.sort_index()
final
2001 1
2002 6
2003 2
2004 6
2005 6
2006 11
2007 9
2008 12
2009 13
2010 9
2011 18
2012 16
2013 29
2014 20
2015 21
2016 21
Name: year, dtype: int64
So it works. Pandas will align and fill the index during basic operations.
You could try dask.dataframe. It is underused because it only offers a subset of pandas functionality. But if the problem is the ugly syntax via chunking, you could try this:
import dask.dataframe as dd
df = dd.read_csv('my_big_file.csv')
counts = df['col'].value_counts()
counts.compute()
Internally, dask deals with chunking, aggregation, etc.

Map dataframe column value by another column's value

My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'
You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017
You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0

looping through columns and adjusting values pandas

I have a df with around 100,000 rows and 1,000 columns and need to make some adjustments based on the existing data. How do I best approach this? Most of the changes will follow this basic formula:
search a column (or two or three) to see if a condition is met
if met, change the values of dozens or hundreds of columns in that row
This is my best attempt, where I created a list of the columns and was looking to see whether the first column contained the value 1. Where it did, I wanted to just add some number. That part worked, but it only worked on the FIRST row, not on all the 1s in the column. To fix that, I think I need to create a loop where I have the second [i] that goes through all the rows, but I wasn't sure if I was approaching the entire problem incorrectly. FWIW, test_cols = list of columns and testing_2 is my df.
def try_this(test_cols):
for i in range(len(test_cols)):
if i == 0 and testing_2[test_cols[i]][i] == 1:
testing_2[test_cols[i]][i]=testing_2[test_cols[i]][i]+78787
i+=1
return test_cols
Edit/example:
Year Month Mean_Temp
City
Madrid 1999 Jan 7--this value should appear twice
Bilbao 1999 Jan 9--appear twice
Madrid 1999 Feb 9
Bilbao 1999 Feb 10
. . . .
. . . .
. . . .
Madrid 2000 Jan 6.8--this value should go away
Bilbao 2000 Jan 9.2--gone
So I would need to do something like (using your answer):
def alter(row):
if row['Year'] == 2000 and row['Month'] == 'Jan':
row['Mean_Temp'] = row['Mean_Temp'] #from year 1999!
return row['Mean_Temp']
else:
return row['Mean_Temp']
One way you could do this is by creating a function and applying it. Suppose you want to increase column 'c' by a factor of 10 if the corresponding row in 'a' or 'b' is an even number.
import pandas as pd
data = {'a':[1,2,3,4],'b':[3,6,8,12], 'c':[1,2,3,4]}
df = pd.DataFrame(data)
def alter(row):
if row['a']%2 == 0 or row['b']%2 == 0:
return row['b']*10
else:
return row['b']
df['c'] = df.apply(alter, axis=1)
would create a df that looks like,
a b c
0 1 3 3
1 2 6 60
2 3 8 80
3 4 12 120
Edit to add:
If you want to apply values from other parts of the df you could put those in a dict and then pass that into your apply function.
import pandas as pd
data = {'Cities':['Madrid', 'Balbao'] * 3, 'Year':[1999] * 4 + [2000] * 2,
'Month':['Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Jan'],
'Mean_Temp':[7, 9, 9, 10, 6.8, 9.2]}
df = pd.DataFrame(data)
df = df[['Cities', 'Year', 'Month', 'Mean_Temp']]
#create dicitonary with the values from 1999
edf = df[df.Year == 1999]
keys = zip(edf.Cities, edf.Month)
values = edf.Mean_Temp
dictionary = dict(zip(keys, values))
def alter(row, dictionary):
if row['Year'] == 2000 and row['Month'] == 'Jan':
return dictionary[(row.Cities, row.Month)]
else:
return row['Mean_Temp']
df['Mean_Temp'] = df.apply(alter, args = (dictionary,), axis=1)
Which gives you a df that looks like,
Cities Year Month Mean_Temp
0 Madrid 1999 Jan 7
1 Balbao 1999 Jan 9
2 Madrid 1999 Feb 9
3 Balbao 1999 Feb 10
4 Madrid 2000 Jan 7
5 Balbao 2000 Jan 9
Of course you can change the parameters however you like. Hope this helps.

Categories