I have a df with around 100,000 rows and 1,000 columns and need to make some adjustments based on the existing data. How do I best approach this? Most of the changes will follow this basic formula:
search a column (or two or three) to see if a condition is met
if met, change the values of dozens or hundreds of columns in that row
This is my best attempt, where I created a list of the columns and was looking to see whether the first column contained the value 1. Where it did, I wanted to just add some number. That part worked, but it only worked on the FIRST row, not on all the 1s in the column. To fix that, I think I need to create a loop where I have the second [i] that goes through all the rows, but I wasn't sure if I was approaching the entire problem incorrectly. FWIW, test_cols = list of columns and testing_2 is my df.
def try_this(test_cols):
for i in range(len(test_cols)):
if i == 0 and testing_2[test_cols[i]][i] == 1:
testing_2[test_cols[i]][i]=testing_2[test_cols[i]][i]+78787
i+=1
return test_cols
Edit/example:
Year Month Mean_Temp
City
Madrid 1999 Jan 7--this value should appear twice
Bilbao 1999 Jan 9--appear twice
Madrid 1999 Feb 9
Bilbao 1999 Feb 10
. . . .
. . . .
. . . .
Madrid 2000 Jan 6.8--this value should go away
Bilbao 2000 Jan 9.2--gone
So I would need to do something like (using your answer):
def alter(row):
if row['Year'] == 2000 and row['Month'] == 'Jan':
row['Mean_Temp'] = row['Mean_Temp'] #from year 1999!
return row['Mean_Temp']
else:
return row['Mean_Temp']
One way you could do this is by creating a function and applying it. Suppose you want to increase column 'c' by a factor of 10 if the corresponding row in 'a' or 'b' is an even number.
import pandas as pd
data = {'a':[1,2,3,4],'b':[3,6,8,12], 'c':[1,2,3,4]}
df = pd.DataFrame(data)
def alter(row):
if row['a']%2 == 0 or row['b']%2 == 0:
return row['b']*10
else:
return row['b']
df['c'] = df.apply(alter, axis=1)
would create a df that looks like,
a b c
0 1 3 3
1 2 6 60
2 3 8 80
3 4 12 120
Edit to add:
If you want to apply values from other parts of the df you could put those in a dict and then pass that into your apply function.
import pandas as pd
data = {'Cities':['Madrid', 'Balbao'] * 3, 'Year':[1999] * 4 + [2000] * 2,
'Month':['Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Jan'],
'Mean_Temp':[7, 9, 9, 10, 6.8, 9.2]}
df = pd.DataFrame(data)
df = df[['Cities', 'Year', 'Month', 'Mean_Temp']]
#create dicitonary with the values from 1999
edf = df[df.Year == 1999]
keys = zip(edf.Cities, edf.Month)
values = edf.Mean_Temp
dictionary = dict(zip(keys, values))
def alter(row, dictionary):
if row['Year'] == 2000 and row['Month'] == 'Jan':
return dictionary[(row.Cities, row.Month)]
else:
return row['Mean_Temp']
df['Mean_Temp'] = df.apply(alter, args = (dictionary,), axis=1)
Which gives you a df that looks like,
Cities Year Month Mean_Temp
0 Madrid 1999 Jan 7
1 Balbao 1999 Jan 9
2 Madrid 1999 Feb 9
3 Balbao 1999 Feb 10
4 Madrid 2000 Jan 7
5 Balbao 2000 Jan 9
Of course you can change the parameters however you like. Hope this helps.
Related
I have a very complex situation to append the rows with one agg. function of sum of population based on different columns. Find below:
Please make sure I have multiple rows in all column such as "year" in range (2019,2040) & "con" have mulitple countries.
import pandas as pd
d = { 'year': [2019,2020,2019,2020,2019,2020], 'age': [10,10,20,20,30,30], 'con': ['UK','UK','UK','US','US','US'],'population': [1,2,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df
year age con population
2019 10 UK 1
2020 10 UK 2
2021 20 UK 300
2019 20 US 400
2020 30 US 1000
2021 30 US 2000
output required:
year age con population
2019 10 UK 1
2020 10 UK 2
2019 10 UK 300
2020 20 US 400
2019 20 US 1000
2020 20 US 2000
2019 10-20 UK child 301 #addition of row 1 + row 3
2020 10-20 UK child 402 #addition of 1+2
2019 20-30 UK teen 1000+ age30 population
I am looking for a loop function so I apply on con col
I am trying, FAILED!!!
variable_list = ['UK', 'US']
ranges = [[0,10], [10,20], [20,30]]
categories = ["Child", "teen", "work"]
year = [x for x in range(2019,2022)]
q = df#df.loc[(df["Kategorie 1"].str.strip()==BASE)]
q["age2"] = pd.to_numeric(q["age"])
sums_years = {}
for variable in variable_list:
c = 0
u = q.loc[q["cat2"]==variable]
for r in ranges:
cat = "Germany: " + categories[c]
for year in date:
group = str(r[0])+'-'+str(r[1])
n = variable + "_" + group
if n not in sums_years:
sums_years[n] = {}
s = u.loc[(u['year']==year) & (u["age"]>=r[0]) & (u["age"]<=r[1]), 'population'].sum()
```
and also like for one condition
df_uk = df[df.con=='UK'].reset_index(drop=True)
div =['child','teen','working']
c = [div[i] for i in range(len(df_uk))] #list to get element from div
y = [i+2018 for i in range(1,len(df_uk)+1)] #list of 2019,2020,2021
x = [[[0,10], [10,20], [20,30]] for i in range(1,len(df_uk)+1)]
d={'year':y, 'age':x, 'con':c, 'population': (df_uk['value'] + #adds_something).values}
df_new = pd.DataFrame(data=d)
df = pd.concat([df, df_new], ignore_index=True)
sorry if its a mess.. I asked people but no help... I am sure there can be easy and better loop function. Please Help!!!!
Is there any better way to melt the dataframe and do all calcuation.. or to restructure the dataframe.
d = { 'year': [2019,2020,2021,2020,2019,2021],
'age': [10,20,30,10,20,30],
'con': ['UK','UK','UK','US','US','US'],
'population': [1,2,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df2 = df.copy()
criteria = [df2['age'].between(0, 10),
df2['age'].between(11, 20),
df2['age'].between(21, 30)]
values = ['child', 'teen', 'work']
df2['con'] = df2['con']+'_'+np.select(criteria, values, 0)
df2['population'] = df.groupby(['con', 'age']).sum()\
.groupby(level=0).cumsum()\
.reset_index()['population']
final = pd.concat([df, df2])
I'm trying to calculate a weighted average for multiple columns in a dataframe.
This is a sample of my data
Group
Year
Month
Weight(kg)
Nitrogen
Calcium
A
2020
01
10000
10
70
A
2020
01
15000
4
78
A
2021
05
12000
5
66
A
2021
05
10000
8
54
B
2021
08
14000
10
90
C
2021
08
50000
20
92
C
2021
08
40000
10
95
My desired result would look something like this:
What I've tried:
I can get the correct weighted average values for a single column using this function:
(similar to: link)
def wavg(df, value, weight):
d = df[value]
w = df[weight]
try:
return (d * w).sum() / w.sum()
except ZeroDivisionError:
return d.mean()
I can apply this function to a single column of my df:
df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()
(Don't mind the different values, they are correct for the data in my notebook)
The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())
It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:
How could I adapt my code to return the desired df?
Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:
def wavg(x, value, weight):
d = x[value]
w = x[weight]
try:
return (d.mul(w, axis=0)).div(w.sum())
except ZeroDivisionError:
return d.mean()
#columns used for groupby
groups = ["Group", "Year", "Month"]
#processing all another columns
cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
#create index and processing all columns by variable cols
df1 = (df.set_index(groups)
.groupby(level=groups)
.apply(wavg, cols, "Weight(kg)")
.reset_index())
print (df2)
Group Year Month Calcium Nitrogen
0 A 2020 1 28.000000 4.000000
1 A 2020 1 46.800000 2.400000
2 A 2021 5 36.000000 2.727273
3 A 2021 5 24.545455 3.636364
4 B 2021 8 90.000000 10.000000
5 C 2021 8 51.111111 11.111111
6 C 2021 8 42.222222 4.444444
Try via concat() and reset_index():
df=pd.concat(column_list,axis=1).reset_index()
OR
you can make changes here:
column_list=[]
for column in df.columns:
column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").reset_index())
#Finally:
df=pd.concat(column_list,axis=1)
How exclude rows from the data by condition.
1- using .loc I selected the part to be removed
2- problem is there are empty rows in "year", I want to keep all the empty and anything < 2020
I would use !< but doesn't work, python just accepts !=
# dataframe
cw=
year name
2022 as
2020 ad
sd
sd
1988 wwe
1999 we
cw = cw.loc[cw['year']!>'2020']
The Problem is the empty fields, its tricky... I need keep everything that is NOT > 2020, so I want to keep the empty and smaller values
Isn't not greater than n the same as less than or equal to n?
cw = cw.loc[cw['year']!>'2020']
simply becomes
cw = cw.loc[cw['year'] <= '2020']
negating the query will also work but it's important that your "year" column be either an int or a timestamp if you want to make sure the > operator works correctly.
Try something more like this:
import pandas as pd
cw = pd.DataFrame({"year": [2022, 2020, None, None, 1988, 1999],
"name": ["as", "ad", "sd", "sd", "wwe", "we"]}, dtype=int)
"""
year name
0 2022 as
1 2020 ad
2 None sd
3 None sd
4 1988 wwe
5 1999 we
"""
cw = cw.loc[~(cw["year"] > 2020)]
"""
year name
1 2020 ad
2 None sd
3 None sd
4 1988 wwe
5 1999 we
"""
I have a dictionary named c with objects as dataframe, each dataframe has 3 columns: 'year' 'month' & 'Tmed' , I want to calculate the monthly mean values of Tmed for each year, I used
for i in range(22) : c[i].groupby(['year','month']).mean().reset_index()
This returns
year month Tmed
0 2018 12 14.8
2 2018 12 12.0
3 2018 11 16.1
5 2018 11 9.8
6 2018 11 9.8
9 2018 11 9.3
4425 rows × 3 columns
The index is not as it should be, and for the 11th month of 2018 for example, there should be only one row but as you see the dataframe has more than one.
I tried the code on a single dataframe and it gave the wanted result :
c[3].groupby(['year','month']).mean().reset_index()
year month Tmed
0 1999 9 23.950000
1 1999 10 19.800000
2 1999 11 12.676000
3 1999 12 11.012000
4 2000 1 9.114286
5 2000 2 12.442308
6 2000 3 13.403704
7 2000 4 13.803846
8 2000 5 17.820000
.
.
.
218 2018 6 21.093103
219 2018 7 24.977419
220 2018 8 26.393103
221 2018 9 24.263333
222 2018 10 19.069565
223 2018 11 13.444444
224 2018 12 13.400000
225 rows × 3 columns
I need to put for loop because I have many dataframes, I can't figure out the issue, any help would be gratefull.
I don't see a reason why your code should fail. I tried below and got the required results:
import numpy as np
import pandas as pd
def getRandomDataframe():
rand_year = pd.DataFrame(np.random.randint(2010, 2011,size=(50, 1)), columns=list('y'))
rand_month = pd.DataFrame(np.random.randint(1, 13,size=(50, 1)), columns=list('m'))
rand_value = pd.DataFrame(np.random.randint(0, 100,size=(50, 1)), columns=list('v'))
df = pd.DataFrame(columns=['year', 'month', 'value'])
df['year'] = rand_year
df['month'] = rand_month
df['value'] = rand_value
return df
def createDataFrameDictionary():
_dict = {}
length = 3
for i in range(length):
_dict[i] = getRandomDataframe()
return _dict
c = createDataFrameDictionary()
for i in range(3):
c[i] = c[i].groupby(['year','month'])['value'].mean().reset_index()
# Check results
print(c[0])
Please check if the year, month combo repeats in different dataframes which could be the reason for the repeat.
In your scenario, it may be a good idea to collect the groupby.mean results for each dataframe in another dataframe and do a groupby mean again on the new dataframe
Can you try the following:
main_df = pd.DataFrame()
for i in range(22):
main_df = pd.concat([main_df, c[i].groupby(['year','month']).mean().reset_index()])
print(main_df.groupby(['year','month']).mean())
My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'
You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017
You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0