I wrote a function that calculates the projected population per year based on values in different columns (these columns are not shown for simplicity).
How do I append these rows to the dataframe?
import pandas as pd
data = {
'state': ['Ohio','New York'],
'year': [2000,2000],
'pop': [2.5,3.6]
}
census = pd.DataFrame(data)
def projected_pop_by_year(s):
new_census = pd.DataFrame()
current_pop = census[census['state'] == s]['pop'].values[0]
current_year = census[census['state'] == s]['year'].values[0]
i = 0; count = 1
while (i + 1) <= current_pop:
projected_pop = None # some calculations
data = {
'state' : [s],
'year' : [current_year + count],
'pop': [projected_pop]
}
print((pd.DataFrame(data)))
i += 1; count += 1
projected_pop_by_year("Ohio")
Desired output:
| State | Year | Pop |
|----------|------|-------|
| Ohio | 2000 | 2.5 |
| New York | 2000 | 3.6 |
| Ohio | 2001 | None |
| Ohio | 2002 | None |
I tried declaring a new dataframe outside the function with global new_census and appending the rows with new_census.append(pd.DataFrame(data)). The code I had didn't work. I tried pd.concat. That didn't work. I tried declaring a new dataframe inside the function. That didn't work.
Any help is appreciated.
This works for me:
def projected_pop_by_year(s):
new_census = pd.DataFrame()
current_pop = census[census['state'] == s]['pop'].values[0]
current_year = census[census['state'] == s]['year'].values[0]
i = 0; count = 1
my_list = []
while (i + 1) <= current_pop:
projected_pop = None # some calculations
data = {
'state' : [s],
'year' : [current_year + count],
'pop': [projected_pop]
}
my_list.append(pd.DataFrame(data))
#print(pd.DataFrame(data))
i += 1; count += 1
my_list = pd.concat(my_list)
print(census.append(pd.DataFrame(my_list)))
projected_pop_by_year("Ohio")
state year pop
0 Ohio 2000 2.5
1 New York 2000 3.6
0 Ohio 2001 None
0 Ohio 2002 None
Explaination Make a list before the while Loop and save the output of the while loop by appending the list. Finally concat them together and apend with the original census dataframe.
Hope this helps.
there are several ways for adding rows to a Pandas DataFrame. When you know how to add the row, you can do it in a while/for loop in a way that matches your requirements. You can find different ways of adding a row to a Pandas DataFrame here:
https://thispointer.com/python-pandas-how-to-add-rows-in-a-dataframe-using-dataframe-append-loc-iloc/
Related
I'm importing into a dataframe an excel sheet which has its headers split into two rows:
Colour | NaN | Shape | Mass | NaN
NaN | width | NaN | NaN | Torque
green | 33 | round | 2 | 6
etc
I want to collapse the first two rows into one header:
Colour | width | Shape | Mass | Torque
green | 33 | round | 2 | 6
...
I tried merged_header = df.loc[0].combine_first(df.loc[1])
but I'm not sure how to get that back into the original dataframe.
I've tried:
# drop top 2 rows
df = df.drop(df.index[[0,1]])
# then add the merged one in:
res = pd.concat([merged_header, df], axis=0)
But that just inserts merged_header as a column. I tried some other combinations of merge from this tutorial but without luck.
merged_header.append(df) gives a similar wrong result, and res = df.append(merged_header) is almost right, but the header is at the tail end:
green | 33 | round | 2 | 6
...
Colour | width | Shape | Mass | Torque
To provide more detail this is what I have so far:
df = pd.read_excel(ltro19, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
in case if affects the next step.
Let's use list comprehension to flatten multiindex column header:
df.columns = [f'{j}' if str(i)=='nan' else f'{i}' for i, j in df.columns]
Output:
['Colour', 'width', 'Shape', 'Mass', 'Torque']
This should work for you:
df.columns = list(df.columns.get_level_values(0))
Probably due to my ignorance of the terms, the suggestions above did not lead me directly to a working solution. It seemed I was working with a dataframe
>>> print(type(df))
>>> <class 'pandas.core.frame.DataFrame'>
but, I think, without headers.
This solution worked, although it involved jumping out of the dataframe and into a list to then put it back as the column headers. Inspired by Merging Two Rows (one with a value, the other NaN) in Pandas
df = pd.read_excel(name_of_file, header=None, skiprows=9)
# delete all empty columns & rows
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')
# merge the two headers which are weirdly split over two rows
merged_header = df.loc[0].combine_first(df.loc[1])
# turn that into a list
header_list = merged_header.values.tolist()
# load that list as the new headers for the dataframe
df.columns = header_list
# drop top 2 rows (old split header)
df = df.drop(df.index[[0,1]])
I have a dataset that looks like this:
country | year | supporting_nation | eco_sup | mil_sup
------------------------------------------------------------------
Fake 1984 US 1 1
Fake 1984 SU 0 1
In this fake example, a nation is playing both sides during the cold war and receiving support from both.
I am reshaping the dataset in two ways:
I removed all non US / SU instances of support, I am only interested in these two countries
I want to reduce it to 1 line per year per country, meaning that I am adding US / SU specific dummy variables for each variable
Like so:
country | year | US_SUP | US_eco_sup | US_mil_sup | SU_SUP | SU_eco_sup | SU_mil_sup |
------------------------------------------------------------------------------------------
Fake 1984 1 1 1 1 1 1
Fake 1985 1 1 1 1 1 1
florp 1984 0 0 0 1 1 1
florp 1985 0 0 0 1 1 1
I added all of the dummies and the US_SUP and SU_SUP columns have been populated with the correct values.
However, I am having trouble with giving the right value to the other variables.
To do so, I wrote the following function:
def get_values(x):
cols = ['eco_sup', 'mil_sup']
nation = ''
if x['SU_SUP'] == 1:
nation = 'SU_'
if x['US_SUP'] == 1:
nation = 'US_'
support_vars = x[['eco_sup', 'mil_sup']]
# Since each line contains only one measure of support I can
# automatically assume that the support_vars are from
# the correct nation
support_cols = [nation + x for x in cols]
x[support_cols] = support_vars
The plan is than to use a df.groupby.agg('max') operation, but I never get to this step as the function above return 0 for each new dummy col, regardless of the value of the columns in the dataframe.
So in the last table all of the US/SU_mil/eco_sup variables would be 0.
Does anyone know what I am doing wrong / why the columns are getting the wrong value?
I solved my problem by abandoning the .apply function and using this instead (where old is a list of the old variable names)
for index, row in df.iterrows():
if row['SU_SUP'] == 1:
nation = 'SU_'
for col in old:
df[index: index + 1][nation + col] = int(row[col])
if row['US_SUP'] == 1:
nation = 'US_'
for col in old:
df[index: index + 1][nation + col] = int(row[col])
This did the trick!
I've got a basic dictionary that gives me a count of how many times data shows up. e.g. Adam: 10, Beth: 3, ... , Zack: 1
If I do df = pd.DataFrame([dataDict]).T then the keys from the dictionary become the index of the dataframe and I only have 1 true column of data. I've looked by I haven't found a way around this so any help would be appreciated.
Edit: More detail
The dictionary was formed from a count function of another dataframe e.g. dataDict = df1.Name.value_counts().to_dict ()
This is my expected output.
| Name | Count
------ | -----|------
0 | Adam | 10
------ | -----|------
1 | Beth | 3
What I'm getting at the moment is this:
| Count
-----|------
Adam | 10
-----|------
Beth | 3
try reset_index
dataDict = dict(Adam=10, Beth=3, Zack=1)
df = pd.Series(dataDict).rename_axis('Name').reset_index(name='Count')
df
I have a DataFrame, df, that looks like:
ID | TERM | DISC_1
1 | 2003-10 | ECON
1 | 2002-01 | ECON
1 | 2002-10 | ECON
2 | 2003-10 | CHEM
2 | 2004-01 | CHEM
2 | 2004-10 | ENGN
2 | 2005-01 | ENGN
3 | 2001-01 | HISTR
3 | 2002-10 | HISTR
3 | 2002-10 | HISTR
ID is a student ID, TERM is an academic term, and DISC_1 is the discipline of their major. For each student, I’d like to identify the TERM when (and if) they changed DISC_1, and then create a new DataFrame that reports when. Zero indicates they did not change. The output looks like:
ID | Change
1 | 0
2 | 2004-01
3 | 0
My code below works, but it’s very slow. I tried to do this using Groupby, but was unable to. Could someone explain how I might accomplish this task more efficiently?
df = df.sort_values(by = ['PIDM', 'TERM'])
c = 0
last_PIDM = 0
last_DISC_1 = 0
change = [ ]
for index, row in df.iterrows():
c = c + 1
if c > 1:
row['change'] = np.where((row['PIDM'] == last_PIDM) & (row['DISC_1'] != last_DISC_1), row['TERM'], 0)
last_PIDM = row['PIDM']
last_DISC_1 = row['DISC_1']
else:
row['change'] = 0
change.append(row['change'])
df['change'] = change
change_terms = df.groupby('PIDM')['change'].max()
Here's a start:
df = df.sort_values(['ID', 'TERM'])
gb = df.groupby('ID').DISC_1
df['Change'] = df.TERM[gb.apply(lambda x: x != x.shift().bfill())]
df.Change = df.Change.fillna(0)
I've never been a big pandas user, so my solution would involve spitting that df out as a csv, and iterating over each row, while retaining the previous row. If it is properly sorted (first by ID, then by Term date) I might write something like this...
import csv
with open('inputDF.csv', 'rb') as infile:
with open('outputDF.csv', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
previousline = reader.next() #grab the first row to compare to the second
termChange = 0
for line in reader:
if line[0] != previousline[0]: #new ID means print and move on to next person
writer.writerow([previousline[0], termChange]) #print to file ID, termChange date
termChange = 0
elif line[2] != previousline[2]: #new discipline
termChange = line[1] #set term changed date
#termChange = previousline[1] #in case you want to rather retain the last date they were in the old dicipline
previousline = line #store current line as previous and continue loop
I have an Impala table that I'd like to query using Ibis. The table looks like the following:
id | timestamp
-------------------
A | 5
A | 7
A | 3
B | 9
B | 5
I'd like to group_by this table according to unique combinations of id and timestamp range. The grouping operation should ultimately produce a single grouped object that I can then apply aggregations on. For example:
group1 conditions: id == A; 4 < timestamp < 11
group2 conditions: id == A; 1 < timestamp < 6
group3 conditions: id == B; 4 < timestamp < 7
yielding a grouped object with the following groups:
group1:
id | timestamp
-------------------
A | 5
A | 7
group2:
id | timestamp
-------------------
A | 5
A | 3
group3:
id | timestamp
-------------------
B | 5
Once I have the groups I'll perform various aggregations to get my final results. If anybody could help me figure this group_by out it would be greatly appreciated, even a regular pandas expression would be helpful!
So here is an example for groupby (no underscore):
df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,3,4,5,6]})
create a grouper column for your timestamp.
df["my interval"] = (df["timestamp"] > 3 )& (df["timestamp"] <5)
"you need some _data_ columns, i.e. those which you do not use for grouping"
df["dummy"] = 1
df.groupby(["id", "my interval"]).agg("count")["dummy"]
Or you can use both:
df["something that I need"] = df["my interval"] & (df["id"] == "b")
df.groupby(["something that I need"]).agg("count")["dummy"]
you might also want to apply integer division to generate time intervals:
df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,13,14,25,26], "sales": [0,4,2,3,6,7]})
epoch = 10
df["my interval"] = epoch* (df["timestamp"] // epoch)
df.groupby(["my interval"]).agg(sum)["sales"]
EDIT:
your example:
import pandas as pd
A = "A"
B = "B"
df = pd.DataFrame({"id":[A,A,A,B,B], "timestamp":[5,7,3,9,5]})
df["dummy"] = 1
Solution:
grouper = (df["id"] == A) & (4 < df["timestamp"] ) & ( df["timestamp"] < 11)
df.groupby( grouper ).agg(sum)["dummy"]
or better:
df[grouper]["dummy"].sum()