I have a dataset that looks like this:
country | year | supporting_nation | eco_sup | mil_sup
------------------------------------------------------------------
Fake 1984 US 1 1
Fake 1984 SU 0 1
In this fake example, a nation is playing both sides during the cold war and receiving support from both.
I am reshaping the dataset in two ways:
I removed all non US / SU instances of support, I am only interested in these two countries
I want to reduce it to 1 line per year per country, meaning that I am adding US / SU specific dummy variables for each variable
Like so:
country | year | US_SUP | US_eco_sup | US_mil_sup | SU_SUP | SU_eco_sup | SU_mil_sup |
------------------------------------------------------------------------------------------
Fake 1984 1 1 1 1 1 1
Fake 1985 1 1 1 1 1 1
florp 1984 0 0 0 1 1 1
florp 1985 0 0 0 1 1 1
I added all of the dummies and the US_SUP and SU_SUP columns have been populated with the correct values.
However, I am having trouble with giving the right value to the other variables.
To do so, I wrote the following function:
def get_values(x):
cols = ['eco_sup', 'mil_sup']
nation = ''
if x['SU_SUP'] == 1:
nation = 'SU_'
if x['US_SUP'] == 1:
nation = 'US_'
support_vars = x[['eco_sup', 'mil_sup']]
# Since each line contains only one measure of support I can
# automatically assume that the support_vars are from
# the correct nation
support_cols = [nation + x for x in cols]
x[support_cols] = support_vars
The plan is than to use a df.groupby.agg('max') operation, but I never get to this step as the function above return 0 for each new dummy col, regardless of the value of the columns in the dataframe.
So in the last table all of the US/SU_mil/eco_sup variables would be 0.
Does anyone know what I am doing wrong / why the columns are getting the wrong value?
I solved my problem by abandoning the .apply function and using this instead (where old is a list of the old variable names)
for index, row in df.iterrows():
if row['SU_SUP'] == 1:
nation = 'SU_'
for col in old:
df[index: index + 1][nation + col] = int(row[col])
if row['US_SUP'] == 1:
nation = 'US_'
for col in old:
df[index: index + 1][nation + col] = int(row[col])
This did the trick!
Related
I have a data frame having 4 columns, 1st column is equal to the counter which has values in hexadecimal.
Data
counter frequency resistance phase
0 15000.000000 698.617126 -0.745298
1 16000.000000 647.001708 -0.269421
2 17000.000000 649.572265 -0.097540
3 18000.000000 665.282775 0.008724
4 19000.000000 690.836975 -0.011101
5 20000.000000 698.051025 -0.093241
6 21000.000000 737.854003 -0.182556
7 22000.000000 648.586792 -0.125149
8 23000.000000 643.014160 -0.172503
9 24000.000000 634.954223 -0.126519
a 25000.000000 631.901733 -0.122870
b 26000.000000 629.401123 -0.123728
c 27000.000000 629.442016 -0.156490
Expected output
| counter | sampling frequency | time. |
| --------| ------------------ |---------|
| 0 | - |t0=0 |
| 1 | 1 |t1=t0+sf |
| 2 | 1 |t2=t1+sf |
| 3 | 1 |t3=t2+sf |
The time column is the new column added to the original data frame. I want to plot time in the x-axis and frequency, resistance, and phase in y-axis.
Because in order to calculate the value of any row you need to calculate the value of the previous row before, you may have to use a for loop for this problem.
For a constant frequency, you could just calculate it in advance, no need to operate in the datafame:
sampling_freq = 1
df['time'] = [sampling_freq * i for i in range(len(df))]
If you need to operate in the dataframe (let's say the frequency may change at some point), in order to call each cell based on row number and column name, you can this suggestion. Syntax would be a lot easier using both numbers for row and column, but I prefer to refer to 'time' instead of 2.
df['time'] = np.zeros(len(df))
for i in range(1, len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + df.iloc[i, df.columns.get_loc('sampling frequency')]
Or, alternatively, resetting the index so you can iterate through consecutive numbers:
df['time'] = np.zeros(len(df))
df = df.reset_index()
for i in range(1, len(df)):
df.loc[i, 'time'] = df.loc[i-1, 'time'] + df.loc[i, 'sampling frequency']
df = df.set_index('counter')
Note that, because your sampling frequency is likely constant in the whole experiment, you could simplify it like:
sampling_freq = 1
df['time'] = np.zeros(len(df))
for i in range(1,len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + sampling_freq
But it's not going to be better than just create the time series as in the first example.
I have a CSV file with several columns and I want to write a code that will read a specific column called 'ARPU average 6 month w/t roaming and discount' and then, create a new column called "Logical" which will be based on numpy.where(). Here is what I got at the moment:
csv_data = pd.read_csv("Results.csv")
data = csv_data[['ARPU average 6 month w/t roaming and discount']]
data = data.to_numpy()
sol = []
for target in data:
if1 = np.where(data < 0, 1, 0)
sol.append(if1)
csv_data["Logical"] = [sol].values
csv_data.to_csv ('Results2.csv', index = False, header=True)
This loop is made incorrectly and does not work. It does not create a new column with the corresponding value for each row. To make it clear: if the value in the column is bigger than 0, it will record "1", otherwise "0". The solution can be in any way (nor np.where(), nor loop is required)
If you want to understand what is "Results.csv"
It is actually a big file with data, I have highlighted the column we work with. The code needs to check if there is a value bigger than 0 in the column and give back 1 or 0 in the new column (as I described in the question)
updated answer
import pandas as pd
f1 = pd.read_csv("L1.csv")
f2 = pd.read_csv("L2.csv")
f3 = pd.merge(f1, f2, on ='CTN', how ='inner')
# f3.to_csv("Results.csv") # -> you do not need to save the file to a csv unless you really want to
# csv_data = pd.read_csv("Results.csv") # -> f3 is already saved in memory you do not need to read it again
# data = csv_data[['ARPU average 6 month w/t roaming and discount']] # -> you do not need this line
f3['Logical'] = (f3['ARPU average 6 month w/t roaming and discount']>0).astype(int)
f3.to_csv('Results2.csv', index = False, header=True)
original answer
Generally you do not need to use a loop when using pandas or numpy. Take this sample dataframe: df = pd.DataFrame([0,1,2,3,0,0,0,1], columns=['data'])
You can simply use the boolean values returned (where column is greater than 0 return 1 else return 0) to create a new column.
df['new_col'] = (df['data'] > 0).astype(int)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1
or if you want to us numpy
df['new_col'] = np.where(df['data']>0, 1, 0)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1
We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0
I wrote a function that calculates the projected population per year based on values in different columns (these columns are not shown for simplicity).
How do I append these rows to the dataframe?
import pandas as pd
data = {
'state': ['Ohio','New York'],
'year': [2000,2000],
'pop': [2.5,3.6]
}
census = pd.DataFrame(data)
def projected_pop_by_year(s):
new_census = pd.DataFrame()
current_pop = census[census['state'] == s]['pop'].values[0]
current_year = census[census['state'] == s]['year'].values[0]
i = 0; count = 1
while (i + 1) <= current_pop:
projected_pop = None # some calculations
data = {
'state' : [s],
'year' : [current_year + count],
'pop': [projected_pop]
}
print((pd.DataFrame(data)))
i += 1; count += 1
projected_pop_by_year("Ohio")
Desired output:
| State | Year | Pop |
|----------|------|-------|
| Ohio | 2000 | 2.5 |
| New York | 2000 | 3.6 |
| Ohio | 2001 | None |
| Ohio | 2002 | None |
I tried declaring a new dataframe outside the function with global new_census and appending the rows with new_census.append(pd.DataFrame(data)). The code I had didn't work. I tried pd.concat. That didn't work. I tried declaring a new dataframe inside the function. That didn't work.
Any help is appreciated.
This works for me:
def projected_pop_by_year(s):
new_census = pd.DataFrame()
current_pop = census[census['state'] == s]['pop'].values[0]
current_year = census[census['state'] == s]['year'].values[0]
i = 0; count = 1
my_list = []
while (i + 1) <= current_pop:
projected_pop = None # some calculations
data = {
'state' : [s],
'year' : [current_year + count],
'pop': [projected_pop]
}
my_list.append(pd.DataFrame(data))
#print(pd.DataFrame(data))
i += 1; count += 1
my_list = pd.concat(my_list)
print(census.append(pd.DataFrame(my_list)))
projected_pop_by_year("Ohio")
state year pop
0 Ohio 2000 2.5
1 New York 2000 3.6
0 Ohio 2001 None
0 Ohio 2002 None
Explaination Make a list before the while Loop and save the output of the while loop by appending the list. Finally concat them together and apend with the original census dataframe.
Hope this helps.
there are several ways for adding rows to a Pandas DataFrame. When you know how to add the row, you can do it in a while/for loop in a way that matches your requirements. You can find different ways of adding a row to a Pandas DataFrame here:
https://thispointer.com/python-pandas-how-to-add-rows-in-a-dataframe-using-dataframe-append-loc-iloc/
I have a DataFrame, df, that looks like:
ID | TERM | DISC_1
1 | 2003-10 | ECON
1 | 2002-01 | ECON
1 | 2002-10 | ECON
2 | 2003-10 | CHEM
2 | 2004-01 | CHEM
2 | 2004-10 | ENGN
2 | 2005-01 | ENGN
3 | 2001-01 | HISTR
3 | 2002-10 | HISTR
3 | 2002-10 | HISTR
ID is a student ID, TERM is an academic term, and DISC_1 is the discipline of their major. For each student, I’d like to identify the TERM when (and if) they changed DISC_1, and then create a new DataFrame that reports when. Zero indicates they did not change. The output looks like:
ID | Change
1 | 0
2 | 2004-01
3 | 0
My code below works, but it’s very slow. I tried to do this using Groupby, but was unable to. Could someone explain how I might accomplish this task more efficiently?
df = df.sort_values(by = ['PIDM', 'TERM'])
c = 0
last_PIDM = 0
last_DISC_1 = 0
change = [ ]
for index, row in df.iterrows():
c = c + 1
if c > 1:
row['change'] = np.where((row['PIDM'] == last_PIDM) & (row['DISC_1'] != last_DISC_1), row['TERM'], 0)
last_PIDM = row['PIDM']
last_DISC_1 = row['DISC_1']
else:
row['change'] = 0
change.append(row['change'])
df['change'] = change
change_terms = df.groupby('PIDM')['change'].max()
Here's a start:
df = df.sort_values(['ID', 'TERM'])
gb = df.groupby('ID').DISC_1
df['Change'] = df.TERM[gb.apply(lambda x: x != x.shift().bfill())]
df.Change = df.Change.fillna(0)
I've never been a big pandas user, so my solution would involve spitting that df out as a csv, and iterating over each row, while retaining the previous row. If it is properly sorted (first by ID, then by Term date) I might write something like this...
import csv
with open('inputDF.csv', 'rb') as infile:
with open('outputDF.csv', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
previousline = reader.next() #grab the first row to compare to the second
termChange = 0
for line in reader:
if line[0] != previousline[0]: #new ID means print and move on to next person
writer.writerow([previousline[0], termChange]) #print to file ID, termChange date
termChange = 0
elif line[2] != previousline[2]: #new discipline
termChange = line[1] #set term changed date
#termChange = previousline[1] #in case you want to rather retain the last date they were in the old dicipline
previousline = line #store current line as previous and continue loop