Create a row in pandas dataframe - python

I am trying to create a row in my existing pandas dataframe and the value of a new row should be a computation
I have a dataframe that looks like the below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
I want to add a row called "Metric" which is the sum of "LE_St" variable for "Rating" >= 4 and <6 divided by "LE_St" for "All" i.e Metric = (0.05+1.77)/10.17
My output dataframe should look like below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
Metric 0.44

I believe your approach to the dataframe is wrong.
Usually rows hold values correlating with columns in a matter that makes sense and not hold random information. the power of pandas and python is for holding and manipulating data. You can easily compute a value from a column or even all columns and store them in a "summary" like dataframe or in separate values. That might help you with this as well.
for computation on a column (i.e. Series object) you can use the .sum() method (or any other of the computational tools) and slice your dataframe by values in the "rating" column.
for random computation of small statistics you will be rather off with excel :)
an example of a solution might look like this:
all = 10.17 # i dont know where this value comes from
df = df[df['rating'].between(4, 6, inclusive=True)]
metric = sliced_df['LE_ST'].sum()/all
print metric # or store it somewhere however you like

Related

Percent of total clusters per cluster per season using pandas

I have a pandas DataFrame that looks like this with 12 clusters in total. Certain clusters don't appear in a certain season.
I want to create a multi-line graph over the seasons of the percent of a specific cluster over each season. So if there are 30 teams in the 97-98 season and there are 10 teams in Cluster 1, then that value would be .33 since cluster 1 has one third of the total possible spots.
It'll look like this
And I want the dateset to look like this, where each cluster has its own percentage of the whole number of clusters in that season by percentage. I've tried using pandas groupby method to get a bunch of lists and then use value_counts() on that but that doesn't work since looping through df.groupby(['SEASON']) returns tuples, not a Series..
Thanks so much
Use .groupby combined with .value_counts and .unstack:
temp_df = df.groupby(['SEASON'])['Cluster'].value_counts(normalize=True).unstack().fillna(0.0)
temp_df.plot()
print(temp_df.round(2))
Cluster 0 1 2 4 5 6 7 10 11
SEASON
1996-97 0.1 0.21 0.17 0.21 0.07 0.1 0.03 0.07 0.03
1997-98 0.2 0.00 0.20 0.20 0.00 0.0 0.20 0.20 0.00

Making arrays of columns (or rows) of a (space-delimited) textfile in Python

I have seen similar questions but the answers always give strings of rows. I want to make arrays of the columns of a text file. I have a textfile like this (There is a text FILE that looks like this, but has 106 rows and 19 columns):
O2 CO2 NOx Ash Other
20.9 1.6 0.04 0.0002 0.0
22.0 2.3 0.31 0.0005 0.0
19.86 2.1 0.05 0.0002 0.0
17.06 3.01 0.28 0.006 0.001
I expect to have arrays of columns (either a 2D array of all columns or a 1D array of each column), and the first row is only for names so then a list for the first row. Since I would like to plot them later.
The desired result would be for example for a column:
array([0.04,
0.31 ,
0.05,
0.28 ], dtype=float32)
and for the first row:
species= ['O2','CO2','NOx','Ash',' Other']
I'd recommend not to manually loop over values in large data sets (in this case a sort of tab separated relational model). Just use the methods of a safe and well-known library like NumPy:
import numpy as np
data = np.transpose(np.loadtxt("/path/to/file.txt", skiprows=1, delimiter="\t"))
with the inner loadtxt you read your file and with skiprows=1 parameter skip the first row (the column names) to avoid incompatible data types and further conversions. If you need this row in the same structure just use insert a new row at index 0. then you need to transpose the matrix for which there's a safe method in NumPy as well. I just used the output of loadtxt (which is a list of lists for each row) as input of transpose to give a one-liner. But it's better to use them apart in order to avoid "train wrecks" and also be able to see what happens in between and eventually correct the unwanted results.
PS: the delimiter parameter must be adjusted to match the one in the original file. Check the loadtxt documentation for more info. I considered it to be a TAB. #KostasCharitidis - thanks for your note
UPDATE3
st = open('file.txt', 'r').read()
dct = []
species = []
for row in st.split('\n')[0].split(' '):
species.append(row)
for no, row in enumerate(st.split('\n')[1:]):
dct.append([])
for elem in row.split(' '):
dct[no].append([float(elem)])
print(species)
print(dct)
RESULT
['O2', 'CO2', 'NOx', 'Ash', 'Other']
[[[20.9], [1.6], [0.04], [0.0002], [0.0]], [[22.0], [2.3], [0.31], [0.0005], [0.0]], [[19.86], [2.1], [0.05], [0.0002], [0.0]], [[17.06], [3.01], [0.28], [0.006], [0.001]]]
file.txt
O2 CO2 NOx Ash Other
20.9 1.6 0.04 0.0002 0.0
22.0 2.3 0.31 0.0005 0.0
19.86 2.1 0.05 0.0002 0.0
17.06 3.01 0.28 0.006 0.001

Transform Scale of Columns into a range from 1 to 10

I am trying to create a set of new columns that would be derived from an existing columns in a dataframe using a function. Here is sample code that produces errors and I wonder if there a better more efficient way to accomplish it than the loop
import numpy as np
import pandas as pd
dates = pd.date_range('1/1/2000', periods=100, freq='M')
long_df = pd.DataFrame(np.random.randn(100, 4),index=dates, columns=['Colorado', 'Texas', 'New York', 'Ohio'])
mylist=['Colorado', 'Texas', 'New York', 'Ohio']
def trnsfrm_1_10 (a, b):
b = (a-np.min(a))/(np.max(a)-np.min(a))*9+1
return b
for a in mylist:
b=a+"_T"
long_df[b] = long_df.apply(lambda row: trnsfrm_1_10(row[a], row[b]), axis=1)
To clarify above question, here is example of DataFrame that has input columns (Colorado, Texas, New York) and output variables (T_Colorado, T_Texas, T_New York). Let's assume that if for each input variable, below are minimum and maximum of each column then by applying equation: b = (a-min)/(max-min)*9+1 to each column, the output variables are T_Colorado T_Texas T_New York. I had to simulate this process in excel based on just 5 rows, but it would be great to compute minimum and maximum as part of the function because I would have a lot more rows in the real data. I am relatively new to Python and Pandas and I really appreciate your help.
These are example min and max
Colorado Texas New York
min 0.03 -1.26 -1.04
max 1.17 0.37 0.86
This is example of a DataFrame
Index Colorado Texas New York T_Colorado T_Texas T_New York
1/31/2000 0.03 0.37 0.09 1.00 10.00 6.35
2/29/2000 0.4 0.26 -1.04 3.92 9.39 1.00
3/31/2000 0.35 -0.06 -0.75 3.53 7.63 2.37
4/30/2000 1.17 -1.26 -0.61 10.00 1.00 3.04
5/31/2000 0.46 -0.79 0.86 4.39 3.60 10.00
IIUC, you should take advantage of broadcasting
long_df2= (long_df - long_df.min())/(long_df.max() - long_df.min()) * 9 + 1
Then concat
pd.concat([long_df, long_df2.add_suffix('_T')], 1)
In your code, the error is that when you define trnsfrm_1_10, b is a parameter while actually it's only your output. It should not be a parameter, especially as it's the value in the new column you want to create during the loop for. so the code would be more something like:
def trnsfrm_1_10 (a):
b = (a-np.min(a))/(np.max(a)-np.min(a))*9+1
return b
for a in mylist:
b=a+"_T"
long_df[b] = long_df.apply(lambda row: trnsfrm_1_10(row[a]), axis=1)
The other thing is that you calculate np.min(a) in trnsfrm_1_10 which actually will be equal to a (same with max) because you apply row wise so a is the unique value in the row and column you are in. I assume what you mean would be more np.min(long_df['a']) which can also be written long_df[a].min()
If I understand well, what you try to perform is actually:
dates = pd.date_range('1/1/2000', periods=100, freq='M')
long_df = pd.DataFrame(np.random.randn(100, 4),index=dates,
columns=['Colorado', 'Texas', 'New York', 'Ohio'])
mylist=['Colorado', 'Texas', 'New York', 'Ohio']
for a in mylist:
long_df[a+"_T"] = (long_df[a]-long_df[a].min())/(long_df[a].max()-long_df[a].min())*9+1
giving then:
long_df.head()
Out[29]:
Colorado Texas New York Ohio Colorado_T Texas_T \
2000-01-31 -0.762666 1.413276 0.857333 0.648960 3.192754 7.768111
2000-02-29 0.148023 0.304971 1.954966 0.656787 4.676018 6.082177
2000-03-31 0.531195 1.283100 0.070963 1.098968 5.300102 7.570091
2000-04-30 -0.385679 0.425382 1.330285 0.496238 3.806763 6.265344
2000-05-31 -0.047057 -0.362419 -2.276546 0.297990 4.358285 5.066955
New York_T Ohio_T
2000-01-31 6.390972 5.659870
2000-02-29 8.242445 5.676254
2000-03-31 5.064533 6.601876
2000-04-30 7.188740 5.340175
2000-05-31 1.104787 4.925180
where all the value in the colum with _T are calculated from the corresponding column.
Ultimately to not use a for loop over the column, you can do:
long_df_T =(((long_df -long_df.min(axis=0))/(long_df.max(axis=0) -long_df.min(axis=0))*9 +1)
.add_suffix('_T'))
to create a dataframe with all the columns with _T at once. Then few option are available to add them in long_df, one way is with join:
long_df = long_df.join(long_df_T)

Loop through grouped data - Python/Pandas

I'm trying to perform an action on grouped data in Pandas. For each group, I want to loop through the rows and compare them to the first row in the group. If conditions are met, then I want to print out the row details. My data looks like this:
Orig Dest Route Vol Per VolPct
ORD ICN A 2,251 0.64 0.78
ORD ICN B 366 0.97 0.13
ORD ICN C 142 0.14 0.05
DCA FRA A 9,059 0.71 0.85
DCA FRA B 1,348 0.92 0.13
DCA FRA C 281 0.8 0.03
My groups are Orig, Dest pairs. If a row in the group other than the first row has a Per greater than the first row and a VolPct greater than .1, I want to output the grouped pair and the route. In this example, the output would be:
ORD ICN B
DCA FRA B
My attempted code is as follows:
for lane in otp.groupby(otp['Orig','Dest']):
X = lane.first(['Per'])
for row in lane:
if (row['Per'] > X and row['VolPct'] > .1):
print(row['Orig','Dest','Route'])
However, this isn't working so I'm obviously not doing something right. I'm also not sure how to tell Python to ignore the first row when in the "row in lane" loop. Any ideas? Thanks!
You are pretty close as it is.
First, you are calling groupby incorrectly. You should just pass a list of the column names instead of a DataFrame object. So, instead of otp.groupby(otp['Orig','Dest']) you should use otp.groupby(['Orig','Dest']).
Once you are looping through the groups you will hit more issues. A group in a groupby object is actually a tuple. The first item in that tuple is the grouping key and the second is the grouped data. For example your first group would be the following tuple:
(('DCA', 'FRA'), Orig Dest Route Vol Per VolPct
3 DCA FRA A 9,059 0.71 0.85
4 DCA FRA B 1,348 0.92 0.13
5 DCA FRA C 281 0.80 0.03)
You will need to change the way you set X to reflect this. For example, X = lane.first(['Per']) should become X = lane[1].iloc[0].Per. After that you only have a minor errors in the way you iterate through the rows and access multiple columns in a row. To wrap it all up your loop should be something like so:
for key, lane in otp.groupby(otp['Orig','Dest']):
X = lane.iloc[0].Per
for idx, row in lane.iterrows():
if (row['Per'] > X and row['VolPct'] > .1):
print(row[['Orig','Dest','Route']])
Note that I use iterrows to iterate through the rows, and I use double brackets when accessing multiple columns in a DataFrame.
You don't really need to tell pandas to ignore the first row in each group as it should never trigger your if statement, but if you did want to skip it you could use lane[1:].iterrows().

Working with samples, applying a function to a large subset of columns

I have data that consist of 1,000 samples from a distribution of a rate for several different countries stored in a pandas DataFrame:
s1 s2 ... s1000 pop
region country
NA USA 0.25 0.27 0.23 300
CAN 0.16 0.14 0.13 35
LA MEX ...
I need to multiply each sample by the population.To accomplish this, I currently have:
for column in data.filter(regex='sample'):
data[column] = data[column]*data['pop']
While this works, iterating over columns feels like it goes against the spirit of python and numpy. Is there a more natural way I'm not seeing? I would normally use apply, but I don't know how to use apply and still get the unique population value for each row.
More context: The reason I need to do this multiplication is because I want to aggregate the data by region, collapsing USA and CAN into North America, for example. However, because my data are rates, I cannot simply add- I must multiply by population to turn them into counts.
I might do something like
>>> df
s1 s2 s1000 pop
region country
NaN USA 0.25 0.27 0.23 300
CAN 0.16 0.14 0.13 35
[2 rows x 4 columns]
>>> df.iloc[:,:-1] = df.iloc[:, :-1].mul(df["pop"], axis=0)
>>> df
s1 s2 s1000 pop
region country
NaN USA 75.0 81.0 69.00 300
CAN 5.6 4.9 4.55 35
[2 rows x 4 columns]
where instead of iloc-ing every column except the last you could use any other loc-based filter.

Categories