I am pretty new to pandas and trying to learn it. So, any advice would be appreciated :)
This is just a small part of my whole dataframe DF2:
Chromosome_Name
Sequence_Source
Sequence_Feature
Start
End
Strand
Gene_ID
Gene_Name
0
1
ensembl_havana
gene
14363
34806
-
"ENSG00000227232"
"WASH7P"
1
1
havana
gene
89295
138566
-
"ENSG00000238009"
"RP11-34P13.7"
2
1
havana
gene
141474
178862
-
"ENSG00000241860"
"RP11-34P13.13"
3
1
havana
gene
227615
272253
-
"ENSG00000228463"
"AP006222.2"
4
1
ensembl_havana
gene
312720
453948
+
"ENSG00000237094"
"RP4-669L17.10"
These are my conditions:
Condition 1: Reference row's "Start" value <= Other row's "End" value.
Condition 2: Reference row's "End" value >= Other row's "Start" value.
This is what I have done so far:
chromosome_list = ["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y"]
dataFrame = DF2.groupby(["Chromosome_Name"])
for chromosome in chromosome_list:
CHR = dataFrame.get_group(chromosome)
for i in range(0, len(CHR)-1):
for j in range(i+1, len(CHR)):
Overlap_index = DF2[(DF2.loc[i, ["Chromosome_Name"] == chromosome]) & (DF2.loc[i, ["Start"]] <= DF2.loc[j, ["End"]]) & (DF2.loc[i, ["End"]] >= DF2.loc[j, ["Start"]])].index
DF2 = DF2.drop(Overlap_index )
The chromosome_list is all the unique values of column "Chromosome_Name".
Mainly, I want to check for each row that whether the columns ("Start" and "End") values are satisfying the conditions above. I believe I need to iterate a single row (reference row) over the particular rows found in the data frame. However, to achieve this I need to consider the value of the first column "Chromosome_Name".
More specifically, every row in DF2 should be checked according to the conditions stated above but, for example, a row at Chromosome_Name = 5 shouldn't be checked with the row of Chromosome_Name = 12. Therefore, first, I thought that I should split the dataframe using pd.groupby() according to Chromosome_Name then, using these dataframes' indexes, I could manipulate (drop the given rows from) the DF2. However, it did not work :)
P.S. After DF2 is splitted into sub dataframes (according to unique Chromosome_Name), each sub dataframe has different size. e.g. There are 641 rows at Chromosome_Name = X but there are 19342 rows for the Chromosome_Name = 1
If you know how to correct my code or provide me another solution, I would be glad.
Thanks in advance.
I am new to pandas too so I do not want to give you a wrong insight and advices but have you ever thougth of converting Start and End columns to lists. So that you can use if statement if you are not comfortable with pandas but your task is urgent. However, I am aware that converting dataframe into list would be something opposite to the creation of pandas.
Related
I'm working with a large dataset that includes all police stops in my city since 2014. The dataset has millions of rows, but sometimes there are multiple rows for a single stop (so, if the police stopped a group of 4 people, it is included in the database as 4 separate rows even though it's all the same stop). I'm looking to create a new column in the dataset orderInStop, which is a count of how many people were stopped in sequential order. The first person caught up in the stop would have a value of 1, the second person a value of 2, and so on.
To do so, I have used the groupby() function to group all rows that match on time & location, which is the indication that the rows are all part of the same stop. I can manage to create a new column that includes the TOTAL count of the number of people in the stop (so, if there were 4 rows with the same time & location, all four rows have a value of 4 for the new orderInStop variable. But I need the first row in the group to have a value of 1, the second a value of 2, the third 3, and the fourth 4.
Below is my code attempt at iterating through each group I've created to sequentially count each row within each group, but the code doesn't quite work (it populates the entire column rather than each row within the groups). Any help to tweak this code would be much appreciated!
Note: I also tried using logical operators in a for loop, to essentially ask IF the time & location column values match for the current and previous rows, but ran into too many problems with 'the truth values of a Series is ambiguous' errors, so instead I'm trying to use groupby().
Attempt that creates a total count rather than sequential count:
df['order2'] = df.groupby(by=["Date_Time_Occur", "Location"])['orderInStop'].transform('count')
Attempt that fails, to iterate through each row in each group:
df['order3'] = 1
grp = df.groupby(by=["Date_Time_Occur", "Location"])
for name, groups in grp:
count = 1
for row in groups:
df['order3'] = count
count = count + 1
In your example for row in groups iterates over the column names, since groups is a DataFrame.
To iterate over each row you could do
df['order3'] = 1
grp = df.groupby(by=["Date_Time_Occur", "Location"])
for name, groups in grp:
count = 1
for i, row in groups.iterrows(): # i will be index, row a pandas Series
df['order3'] = count
count = count + 1
Note that your solution relies on pandas groupby to preserve row order. This should be the case, see this question, but there is very likely a shorter & safer solution (see fsimonjetz comment for a starting point).
pretty new to python, I have a question regarding dataframes.
So currently what I have is two different dataframes. One of them goes like this:
There is a problem regarding the table format so I am posting an image
The other one goes like this:
A
B
C
1
513
2
513
3
730
What I need to get is, to extract the value from the first dataframe that matches the second dataframe's respected rows. For example:The first row of the second dataframe matches 1 to 513, and the value that corresponds this match in the first dataframe is 1.8. For second row in the second dataframe, the match is 2 to 513, and the value that corresponds in the first dataframe is 2. How can I extract these values in a pythonic way?
My current code is like this:
data = first_dataframe
B = second_dataframe['B'].tolist()
for i in range(954):
for j in B:
output = data.iat[i, j]
print(output)
unfortunately what I'm getting is every single value of the first column of the first dataframe matching with all the values of the first row in the first dataframe. What should I change? Thanks in advance.
DataFrame1:
df1=pd.DataFrame({
1:[0.5,0.4,0.1],
2:[0.6,0.2,0.5],
3:[0.5,0.4,0.7],
513:[1.8,2,5],
}, index=[1,2,3]).stack().to_frame(name="C")
DataFrame2:
df2=pd.DataFrame({
"A":[1,2,3],
"B":[513,513,730],
}, index=[1,2,3])
Merge:
df2.merge(df1,left_on=["A","B"],right_index=True)
Result:
| |a|b|c|
|--|--|--|--|
|1|1|513|1.8|
|2|2|513|2.0|
hope this is what you want?
I am relatively new to Python and Pandas. I have two dataframes, one contains a column of codes separated by a comma - the number of codes in each list can vary and can contain a string such as 'Not Applicable' or a blank. The other is a lookup table of the codes and a value. I want to lookup the value of each individual code in each list and calculate the maximum value within that list. For example ['H302','H304'] would be [18,11] and the maximum value of those two would be 18. I then want to return the maximum value of each list as a new column to df2. If it contains anything else, return blank.
This process was originally written in VBA, I solved the problem there by splitting each set of codes by delimiter to a new column, then dynamically running index/matches against each code to return the value. Then it would calculate the maximum value and delete out all the generated columns. I thought at the time it was a messy way to do it and I don't want to replicate this in the Python version.
I would post what I've tried by I can't figure out how I'd go about this - any help is appreciated!
import pandas as pd
df1 = [['H302',18],
['H312',17],
['H315',16],
['H316',15],
['H319',14],
['H320',13],
['H332',12],
['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])
df2 = [['H302,H304'],
['H332,H319,H312,H320,H316,H315,H302,H304'],
['H315,H312,H316'],
['H320,H332,H316,H315,H304,H302,H312'],
['H315,H319,H312,H316,H332'],
['H312'],
['Not Applicable'],
['']]
df2 = pd.DataFrame(df2, columns=['Code'])
df3 = []
for i in range(len(df2)):
df3.append(df2['Code'][i].split(","))
max_values = []
for i in range(len(df3)):
for j in range(len(df3[i])):
for index in range(len(df1)):
if df1['Code'][index] == df3[i][j]:
df3[i][j] = df1['Value'][index]
max_values.append(max(df3[i]))
df2["Max Value"] = max_values
First, df2 seems to be defined wrongly (single quotes between comas are required). Also, don't generate a data frame of it since you need to be flexible to have any number of elements.
Second, you would need to define the codes as the index to look for elements in the data frame. So, you would define the data frame as:
df1 = pd.DataFrame(df1, columns=['Code', 'Value']).set_index('Code')
Third, you need to loop through the second list of lists and index the elements you want before calculating the maximum using .loc. Also, you need to filter out the codes that are not in the first data frame.
result = []
for codes in df2:
c = [_ for _ in codes if _ in df1.index]
result.append(df1.loc[c,'Value'].max())
Try:
df2.join(df2['Code'].str.split(',')\
.explode()\
.map(df1.set_index('Code')['Value']).groupby(level=0).max().rename('Value'))
Output:
Code Value
0 H302,H304 18.0
1 H332,H319,H312,H320,H316,H315,H302,H304 18.0
2 H315,H312,H316 17.0
3 H320,H332,H316,H315,H304,H302,H312 18.0
4 H315,H319,H312,H316,H332 17.0
5 H312 17.0
6 Not Applicable NaN
7 NaN
I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T
I am trying to compare all rows within a group to check if a condition is fulfilled. If the condition is not fulfilled, I set the new column to True, else False. The issue I am having is finding a neat way to compare all rows within each group. I have something that works but will not work where there are a lot of rows in a group.
for i in range(8):
n = -i-1
cond=(((df['age']-df['age'].shift(n))*(df['weight']-df['weight'].shift(n)))<0)&(df['ref']==df['ref'].shift(n))&(df['age']<7)&(df['age'].shift(n)<7)
df['x'+i] = cond.groupby(df['ref']).transform('any')
df.loc[:,'WFA'] = 0
df.loc[(df['x0']==False)&(df['x1']==False)&(df['x2']==False)&(df['x3']==False)&(df['x4']==False)&(df['x5']==False)&(df['x6']==False)&(df['x7']==False),'WFA'] = 1
To iterate through each row, I have created a loop that compares adjacent rows (using shift). Each loop represents the next adjacent row. In effect, I am able to compare all rows within a group where the number of rows within a group is 8 or less. As you can imagine, it becomes pretty cumbersome as the number of rows grows large.
Instead of creating of column for each period in shift, I want to see if any row matches the condition with any other row. Then set the new column 'WFA' True or False.
If anyone is interested, I post the answer to my own question here (although it is very slow):
df.loc[:,'WFA'] = 0
for ref, gref in df.groupby('ref'):
count=0
for r_idx, row in gref.iterrows():
cond = ((((row['age']-gref.loc[gref['age']<7, 'age'])*(row['weight']-gref.loc[gref['age']<7, 'weight']))<0).any())&(row['age']<7)
if cond==False:
count+=1
if count==len(gref):
df.loc[df['ref']==ref, 'WFA'] = 1