I am relatively new to Python and Pandas. I have two dataframes, one contains a column of codes separated by a comma - the number of codes in each list can vary and can contain a string such as 'Not Applicable' or a blank. The other is a lookup table of the codes and a value. I want to lookup the value of each individual code in each list and calculate the maximum value within that list. For example ['H302','H304'] would be [18,11] and the maximum value of those two would be 18. I then want to return the maximum value of each list as a new column to df2. If it contains anything else, return blank.
This process was originally written in VBA, I solved the problem there by splitting each set of codes by delimiter to a new column, then dynamically running index/matches against each code to return the value. Then it would calculate the maximum value and delete out all the generated columns. I thought at the time it was a messy way to do it and I don't want to replicate this in the Python version.
I would post what I've tried by I can't figure out how I'd go about this - any help is appreciated!
import pandas as pd
df1 = [['H302',18],
['H312',17],
['H315',16],
['H316',15],
['H319',14],
['H320',13],
['H332',12],
['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])
df2 = [['H302,H304'],
['H332,H319,H312,H320,H316,H315,H302,H304'],
['H315,H312,H316'],
['H320,H332,H316,H315,H304,H302,H312'],
['H315,H319,H312,H316,H332'],
['H312'],
['Not Applicable'],
['']]
df2 = pd.DataFrame(df2, columns=['Code'])
df3 = []
for i in range(len(df2)):
df3.append(df2['Code'][i].split(","))
max_values = []
for i in range(len(df3)):
for j in range(len(df3[i])):
for index in range(len(df1)):
if df1['Code'][index] == df3[i][j]:
df3[i][j] = df1['Value'][index]
max_values.append(max(df3[i]))
df2["Max Value"] = max_values
First, df2 seems to be defined wrongly (single quotes between comas are required). Also, don't generate a data frame of it since you need to be flexible to have any number of elements.
Second, you would need to define the codes as the index to look for elements in the data frame. So, you would define the data frame as:
df1 = pd.DataFrame(df1, columns=['Code', 'Value']).set_index('Code')
Third, you need to loop through the second list of lists and index the elements you want before calculating the maximum using .loc. Also, you need to filter out the codes that are not in the first data frame.
result = []
for codes in df2:
c = [_ for _ in codes if _ in df1.index]
result.append(df1.loc[c,'Value'].max())
Try:
df2.join(df2['Code'].str.split(',')\
.explode()\
.map(df1.set_index('Code')['Value']).groupby(level=0).max().rename('Value'))
Output:
Code Value
0 H302,H304 18.0
1 H332,H319,H312,H320,H316,H315,H302,H304 18.0
2 H315,H312,H316 17.0
3 H320,H332,H316,H315,H304,H302,H312 18.0
4 H315,H319,H312,H316,H332 17.0
5 H312 17.0
6 Not Applicable NaN
7 NaN
Related
I am trying to do something very basic in pandas and failing miserably.
From a high level I am taking ask_size data from my broker who passes the value to me on every tick update.
I can print out the last value easily enough.
All I am trying to do is append the next ask_size amount to the previous ask_size, to the end of a df in a new row, so I can do some historical analysis.
def getTickSize():
askSize_list = [] # empty list
askSize_list.append(float(ask_size)) # getting askSize and putting it in a list
datagrab = {'ask_size': askSize_list} # creating the single column and putting askSize in
df = pd.DataFrame(datagrab) # using a pd df
print(df.tail(10))
I am then calling the function in a different part of my script
However the output always only shows the last askSize:
askSize
0 30.0
And never actually appends the real-time data
Clearly I am doing something wrong, but I am at a loss to what.
I have also tried using the ignore_index=True in a second df, refencing the first, but no joy:
askSize
0 30.0
1 30.0
I have also tried using 'for loops' but as there doesn't seem to be anything to iterate over (data is real-time) I came to a dead end
(note I will also eventually add a timestamp to each new ask_size as it is appended to the list. So only 2 columns, in the end)
Any help is much appreciated
it seems you are creating a new dataframe, not appending new data.
You could, for example, create a new dataframe that will be appended to the existing data frame with the row(s) in the same format.
Lets say you have already df created. You want to add 1 new entry that will be read as a parameter (if you need more, specify more parameters), here is a basic example:
'askSize'
1.0
2.0
def append_row(newdata, dataframe):
row = {'ask_size': [newdata]}
temp_df = pd.DataFrame(row)
# merge original dataframe with temp_df
merged_df = pd.concat([dataframe, temp_df])
return merged_df
df = append_row("5.1", df) # this will overwrite your original df
'askSize'
1.0
2.0
5.1
You would need to call the function to add a new row (for instance calling it from inside a loop or any other part of the code).
You can also use df.append() and other methods, here are some links that could be useful for your use case:
Merge, join, concatenate and compare (Pandas.pydata.org)
Example of using pd.append() (Pandas.pydata.org)
I am writing a script to count a percentage of cells that has a specific value. However, when it counts the rows it does not count out the cells that are NaN. Basically I do not want the script to count a cell with the value NaN as a row. I have tried everything from != ""
to .isnan
What im trying to do is calculating the percentage of cells that has a specific value which is not possible if the function counts the rows with NaN value.
RELEVANT CODE
df2 = pd.DataFrame(supplier_data_df, columns=['supplier keywords', 'supplier in ocr'])
total_suppliers = df2[(df2["supplier in ocr"] != "") & (df2["supplier keywords"] != "")]
percentilesupplierkeyword = len(supplier_filtered_df)/len(total_suppliers) * 100
print(percentilesupplierkeyword,"% of supplier-keywords have an issue")
Thank you in advance.
I hope you're doing good.
You can either consider dropping the NaN values or excluding them from your dataframe and then perform your following computations.
If you want to drop the NaN values
df2.dropna(inplace=True)
Or you could use the fillna method to fill the nan values with 0.
df2.fillna(0, inplace=True)
If you want to get the index list of the nan values
df2[df2["col1"].isna()].index.tolist()
I am pretty new to pandas and trying to learn it. So, any advice would be appreciated :)
This is just a small part of my whole dataframe DF2:
Chromosome_Name
Sequence_Source
Sequence_Feature
Start
End
Strand
Gene_ID
Gene_Name
0
1
ensembl_havana
gene
14363
34806
-
"ENSG00000227232"
"WASH7P"
1
1
havana
gene
89295
138566
-
"ENSG00000238009"
"RP11-34P13.7"
2
1
havana
gene
141474
178862
-
"ENSG00000241860"
"RP11-34P13.13"
3
1
havana
gene
227615
272253
-
"ENSG00000228463"
"AP006222.2"
4
1
ensembl_havana
gene
312720
453948
+
"ENSG00000237094"
"RP4-669L17.10"
These are my conditions:
Condition 1: Reference row's "Start" value <= Other row's "End" value.
Condition 2: Reference row's "End" value >= Other row's "Start" value.
This is what I have done so far:
chromosome_list = ["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y"]
dataFrame = DF2.groupby(["Chromosome_Name"])
for chromosome in chromosome_list:
CHR = dataFrame.get_group(chromosome)
for i in range(0, len(CHR)-1):
for j in range(i+1, len(CHR)):
Overlap_index = DF2[(DF2.loc[i, ["Chromosome_Name"] == chromosome]) & (DF2.loc[i, ["Start"]] <= DF2.loc[j, ["End"]]) & (DF2.loc[i, ["End"]] >= DF2.loc[j, ["Start"]])].index
DF2 = DF2.drop(Overlap_index )
The chromosome_list is all the unique values of column "Chromosome_Name".
Mainly, I want to check for each row that whether the columns ("Start" and "End") values are satisfying the conditions above. I believe I need to iterate a single row (reference row) over the particular rows found in the data frame. However, to achieve this I need to consider the value of the first column "Chromosome_Name".
More specifically, every row in DF2 should be checked according to the conditions stated above but, for example, a row at Chromosome_Name = 5 shouldn't be checked with the row of Chromosome_Name = 12. Therefore, first, I thought that I should split the dataframe using pd.groupby() according to Chromosome_Name then, using these dataframes' indexes, I could manipulate (drop the given rows from) the DF2. However, it did not work :)
P.S. After DF2 is splitted into sub dataframes (according to unique Chromosome_Name), each sub dataframe has different size. e.g. There are 641 rows at Chromosome_Name = X but there are 19342 rows for the Chromosome_Name = 1
If you know how to correct my code or provide me another solution, I would be glad.
Thanks in advance.
I am new to pandas too so I do not want to give you a wrong insight and advices but have you ever thougth of converting Start and End columns to lists. So that you can use if statement if you are not comfortable with pandas but your task is urgent. However, I am aware that converting dataframe into list would be something opposite to the creation of pandas.
I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1
Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.
You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.
I have a column in a dataframe called 'CREDIT RATING' for a number of companies across rows. I need to assign a numerical category for ratings like AAA to DDD from 1(AAA) to 0(DDD). is there a quick simple way to do this and basically create a new column where i get numbers 1-0 by .1's? Thanks!
You could use replace:
df['CREDIT RATING NUMERIC'] = df['CREDIT RATING'].replace({'AAA':1, ... , 'DDD':0})
The easiest way is to simply create a dictionary mapping:
mymap = {"AAA":1.0, "AA":0.9, ... "DDD":0.0}
and then apply it to the dataframe:
df["CREDIT MAPPING"] = df["CREDIT RATING"].replace(mymap)
Ok, this was kinda though without nothing to work with but here we go:
# First getting a ratings list acquired from wikipedia than setting into a dataframe to replicate your scenario
ratings = ['AAA' ,'AA1' ,'AA2' ,'AA3' ,'A1' ,'A2' ,'A3' ,'BAA1' ,'BAA2' ,'BAA3' ,'BA1' ,'BA2' ,'BA3' ,'B1' ,'B2' ,'B3' ,'CAA' ,'CA' ,'C' ,'C' ,'E' ,'WR' ,'UNSO' ,'SD' ,'NR']
df_credit_ratings = pd.DataFrame({'Ratings_id':ratings})
df_credit_ratings = pd.concat([df_credit_ratings,df_credit_ratings]) # just to replicate duplicate records
# The set() command get the unique values
unique_ratings = set(df_credit_ratings['Ratings_id'])
number_of_ratings = len(unique_ratings) # counting how many unique there are
number_of_ratings_by_tenth = number_of_ratings/10 # Because from 0 to 1 by 0.1 to 0.1 there are 10 positions.
# the numpy's arange fills values in between from a range (first two numbers) and by which decimals (third number)
dec = list(np.arange(0.0, number_of_ratings_by_tenth, 0.1))
After this you'll need to mix the unique ratings to it's weigths:
df_ratings_unique = pd.DataFrame({'Ratings_id':list(unique_ratings)}) # list so it gets one value per row
EDIT: as Thomas suggested in another answer's comment, this sort probably wont fit you because it won't be the real order of importance of the ratings. So you'll probably need to first create a dataframe with them already in order and no neet to sort.
df_ratings_unique.sort_values(by='Ratings_id', ascending=True, inplace=True) # sorting so it matches the order of our weigths above.
Resuming the solution:
df_ratings_unique['Weigth'] = dec # adding the weigths to the DF
df_ratings_unique.set_index('Ratings_id', inplace=True) # setting the Rantings as index to map the values bellow
# now this is the magic, we're creating a new column at the original Dataframe and we'll map according to the `Ratings_id` by our unique dataframe
df_credit_ratings['Weigth'] = df_credit_ratings['Ratings_id'].map(df_ratings_unique.Weigth)