Related
Two DataFrames have gene and isoform names that are not formatted the same way. I'd like to do a join and add the df2 columns name, isoform for all partial string matches between the isoform (df2) and the name (df1) in both DataFrames. df2 is a key for the isoforms/genes, where a gene can have many isoforms. In df1, basically an output from a gene-quantification software (SALMON) the name field has both, the gene and isoform in it. I cant use regex since isoforms have variable suffixs, such as ".","_", "-", and many others.
Another important piece of information is that each df1["Name"] cell has a unique isoform.
Piece of dfs to merge:
import pandas as pd
df1 = pd.DataFrame({'Name': {0: 'AT1G01010;AT1G01010.1;Isoseq::Chr1:3616-5846', 1: 'AT1G01010;AT1G01010_2;Isoseq::Chr1:3630-5894', 2: 'AT1G01010;AT1G01010.3;Isoseq::Chr1:3635-5849', 3: 'AT1G01020;AT1G01020.11;Isoseq::Chr1:6803-8713', 4: 'AT1G01020;AT1G01020.13;Isoseq::Chr1:6811-8713'}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
df2 = pd.DataFrame({'gene': {0: 'AT1G01010', 14: 'AT1G01010', 30: 'AT1G01010', 46: 'AT1G01020', 62: 'AT1G01020', 80: 'AT1G01020', 100: 'AT1G01020', 116: 'AT1G01020', 138: 'AT1G01020', 156: 'AT1G01020'}, 'isoform': {0: 'AT1G01010.1', 14: 'AT1G01010_2', 30: 'AT1G01010.3', 46: 'AT1G01020.1', 62: 'AT1G01020.10', 80: 'AT1G01020.11', 100: 'AT1G01020.12', 116: 'AT1G01020.13', 138: 'AT1G01020.14', 156: 'AT1G01020.15'}})
display(df1)
display(df2)
Desired output:
df3 = pd.DataFrame({'gene': {0: 'AT1G01010', 1:"AT1G01010", 2:"AT1G01010", 3:"AT1G01020", 4:"AT1G01020"},'isoform': {0: 'AT1G01010.1',1:"AT1G01010_2", 2:"AT1G01010.3", 3:"AT1G01020.11", 4:"AT1G01020.13"}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
#"Name" column from df1 is not necessary anymore. (the idea is to replace it for gene and isoform)
display(df3)
Real dfs size:
df1 = 143646 rows × 5 columns
df2 = 169499 rows × 2 columns
(since df1 may not have all the isoforms detected, it's always smaller than df2)
I tried some answers i found online, but since this dfs have a huge size, many need 50gb of RAM or so...
Already checked: Merge Dataframes Based on Partial Substrings Match, Join to Dataframes based on partial string matches in python, Join dataframes based on partial string-match between columns
Thanks for the help!
I am working on an optimization problem and need to create indexing to build a mixed-integer mathematical model. I am using python dictionaries for the task. Below is a sample of my dataset. Full dataset is expected to have about 400K rows if that matters.
# sample input data
pd.DataFrame.from_dict({'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}})
The data frame looks like this -
I am looking to generate a dictionary from each row of this dataframe where the keys and values are tuples made of multiple column values. The output would look like this -
(origin, dest, product, ship_date): (origin, dest, product, truck_in)
# for example, first two rows will become a dictionary key-value pair like
{('perris', 'alexandria', 'bike', '2/25/2022'): ('perris', 'alexandria', 'bike', '3/1/2022'),
('perris', 'alexandria', 'bike', '2/26/2022'): ('perris', 'alexandria', 'bike', '3/2/2022')}
I am very new to python and couldn't figure out how to do this. Any help is appreciated. Thanks!
You can loop through the DataFrame.
Assuming your DataFrame is called "df" this gives you the dict.
result_dict = {}
for idx, row in df.iterrows():
result_dict[(row.origin, row.dest, row['product'], row.ship_date )] = (
row.origin, row.dest, row['product'], row.truck_in )
Since looping through 400k rows will take some time, have a look at tqdm (https://tqdm.github.io/) to get a progress bar with a time estimate that quickly tells you if the approach works for your dataset.
Also, note that 400K dictionary entries may take up a lot of memory so you may try to estimate if the dict fits your memory.
Another, memory waisting but faster way is to do it in Pandas
Create a new column with the value for the dictionary
df['value'] = df.apply(lambda x: (x.origin, x.dest, x['product'], x.truck_in), axis=1)
Then set the index and convert to dict
df.set_index(['origin','dest','product','ship_date'])['value'].to_dict()
The approach below splits the initial dataframe into two dataframes that will be the source of the keys and values in the dictionary. These are then converted to arrays in order to get away from working with dataframes as soon as possible. The arrays are converted to tuples and zipped together to create the key:value pairs.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(
{'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}}
)
#display(df)
#desired output: (origin, dest, product, ship_date): (origin, dest, product, truck_in)
#slice df to key/value chunks
#list to array
ship = df[['origin','dest', 'product', 'ship_date']]
ship.set_index('origin', inplace = True)
keys_array=ship.to_records()
truck = df[['origin', 'dest', 'product', 'truck_in']]
truck.set_index('origin', inplace = True)
values_array = truck.to_records()
#array_of_tuples = map(tuple, an_array)
keys_map = map(tuple, keys_array)
values_map = map(tuple, values_array)
#tuple_of_tuples = tuple(array_of_tuples)
keys_tuple = tuple(keys_map)
values_tuple = tuple(values_map)
zipp = zip(keys_tuple, values_tuple)
dict2 = dict(zipp)
print(dict2)
I'm using Python 3 and currently playing with the latest version of Bokeh.
I've imported everything necessary, but I'm a little bit stuck with a single (I hope) line of code.
I'm using the US County sample data. I want to hover over the map and for it to display the vote percentages for each respective county, as they're hovered over with the cursor.
I've searched on here for other bokeh examples and explicitly the US County data but I can only seem to find questions regarding issues with map shape.
from bokeh.models import LogColorMapper
from bokeh.palettes import Viridis6 as palette
from bokeh.sampledata.us_counties import data as counties
palette = tuple(reversed(palette))
color_mapper = LogColorMapper(palette=palette)
counties = {
code: county for code, county in counties.items() if county['state'] == 'tx'
}
county_xs = [county['lons'] for county in counties.values()]
county_ys = [county['lats'] for county in counties.values()]
county_names = [county['name'] for county in counties.values()]
## Below is the variable I wish to create, and these are the columns and dataframe of importance.
#county_vote_total =
#texasJbFinal['County Vote Percentage'] - where the vote percentages are
#texasJbFinal['County'] - What my own df county column is labelled as.
data = dict(
x=county_xs,
y=county_ys,
name=county_names,
voteP=county_vote_total
)
TOOLS = "pan,wheel_zoom,reset,hover,save"
p = figure(
title='Joe Biden Texas Vote Percentage',
tools=TOOLS,
x_axis_location=None, y_axis_location=None,
tooltips=[
("Name", "#name"), ("Vote Percentage", "#voteP"), ("Long, lat", "($x, $y)")
]
)
p.grid.grid_line_color=None
p.hover.point_policy = "follow_mouse"
p.patches("x", "y", source=data, fill_color={"field": "voteP", "transform": color_mapper},
fill_alpha=0.6, line_color="black", line_width=0.5)
show(p)
I have tried a few things but I can't seem to figure out how to match up each individual county from my texasJbFinal dataframe with the bokeh.sampledata.us_counties and then display the vote percentage as each is hovered over.
Here is a sample of my DF, using texasJbFinal.head(5).to_dict()
{'State': {0: 'Texas', 1: 'Texas', 2: 'Texas', 3: 'Texas', 4: 'Texas'},
'County': {0: 'Roberts County',
1: 'Borden County',
2: 'King County',
3: 'Glasscock County',
4: 'Armstrong County'},
'Candidate': {0: 'Joe Biden',
1: 'Joe Biden',
2: 'Joe Biden',
3: 'Joe Biden',
4: 'Joe Biden'},
'Total Votes': {0: 17, 1: 16, 2: 8, 3: 39, 4: 75},
'County Vote Percentage': {0: 3.091, 1: 3.846, 2: 5.031, 3: 5.972, 4: 6.745},
'Total Population': {0: 912, 1: 697, 2: 315, 3: 2171, 4: 2122},
'White Alone': {0: 782, 1: 598, 2: 234, 3: 1003, 4: 1833},
'White Alone Percent': {0: 85.74561403508771,
1: 85.79626972740316,
2: 74.28571428571428,
3: 46.19990787655458,
4: 86.38077285579642},
'Black or African American Alone': {0: 0, 1: 0, 2: 0, 3: 0, 4: 5},
'Black or African American Alone Percent': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.23562676720075398},
'American Indian and Alaska Native Alone': {0: 0, 1: 0, 2: 0, 3: 0, 4: 22},
'Asian Alone': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Native Hawaiian and Other Pacific Islander Alone': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0},
'Some other race Alone': {0: 0, 1: 0, 2: 3, 3: 507, 4: 42},
'Two or more races': {0: 23, 1: 15, 2: 0, 3: 0, 4: 71},
'Hispanic or Latino Alone': {0: 107, 1: 84, 2: 78, 3: 661, 4: 149},
'Hispanic or Latino Alone Percent': {0: 11.732456140350877,
1: 12.051649928263988,
2: 24.76190476190476,
3: 30.446798710271764,
4: 7.021677662582469}}
Here's how I'd tackle it:
Turn the Bokeh counties data into a DataFrame to merge with your existing df. Something like:
bokeh_counties = pd.DataFrame.from_records([county for key, county in counties.items()])
...and then you'd have to do some regex matching or other text manipulation to merge, since your values are all appended with " County" and those in the Bokeh dataset are not.
Once you've got the merged DataFrame with all the data you need, convert to a ColumnDataSource for use by the Bokeh glyphs and hovertool. While CDSes aren't absolutely required for a lot of Bokeh tasks, they tend to make things much easier.
Thanks for the help. I didn't quite go your route, but it gave me inspiration to solve my issue.
I turned the counties dictionary into a dataframe, done a little text manipulation, merged with my original pandas dataframe, turned it all back into one dictionary and everything became very simple after that.
Thanks again for the great answer :)
This seems to have been similarly answered, but I can't get it to work.
I have a pandas DataFrame that looks like sig_vars below. This df has a VAF and a Background column. I would like to use the ztest function from statsmodels to assign a p-value to a new p-value column.
The p-value is calculated something like this for each row:
from statsmodels.stats.weightstats import ztest
p_value = ztest(sig_vars.Background,value=sig_vars.VAF)[1]
I have tried something like this, but I can't quite get it to work:
def calc(x):
return ztest(x.Background, value=x.VAF.astype(float))[1]
sig_vars.dropna().assign(pval = lambda x: calc(x)).head()
It seems strange to me that this works just fine however:
def calc(x):
return ztest([0.0001,0.0002,0.0001], value=x.VAF.astype(float))[1]
sig_vars.dropna().assign(pval = lambda x: calc(x)).head()
Here is my DataFrame sig_vars:
sig_vars = pd.DataFrame({'AO': {0: 4.0, 1: 16.0, 2: 12.0, 3: 19.0, 4: 2.0},
'Background': {0: nan,
1: [0.00018832391713747646, 0.0002114408734430263, 0.000247843759294141],
2: nan,
3: [0.00023965141612200435,
0.00018864365214110544,
0.00036566589684372596,
0.0005452562704471102],
4: [0.00017349063150589867]},
'Change': {0: 'T>A', 1: 'T>C', 2: 'T>A', 3: 'T>C', 4: 'C>A'},
'Chrom': {0: 'chr1', 1: 'chr1', 2: 'chr1', 3: 'chr1', 4: 'chr1'},
'ConvChange': {0: 'T>A', 1: 'T>C', 2: 'T>A', 3: 'T>C', 4: 'C>A'},
'DP': {0: 16945.0, 1: 16945.0, 2: 16969.0, 3: 16969.0, 4: 16969.0},
'Downstream': {0: 'NaN', 1: 'NaN', 2: 'NaN', 3: 'NaN', 4: 'NaN'},
'Gene': {0: 'TIIIa', 1: 'TIIIa', 2: 'TIIIa', 3: 'TIIIa', 4: 'TIIIa'},
'ID': {0: '86.fastq/onlyProbedRegions.vcf',
1: '86.fastq/onlyProbedRegions.vcf',
2: '86.fastq/onlyProbedRegions.vcf',
3: '86.fastq/onlyProbedRegions.vcf',
4: '86.fastq/onlyProbedRegions.vcf'},
'Individual': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'IntEx': {0: 'TIII', 1: 'TIII', 2: 'TIII', 3: 'TIII', 4: 'TIII'},
'Loc': {0: 115227854, 1: 115227854, 2: 115227855, 3: 115227855, 4: 115227856},
'Upstream': {0: 'NaN', 1: 'NaN', 2: 'NaN', 3: 'NaN', 4: 'NaN'},
'VAF': {0: 0.00023605783416937148,
1: 0.0009442313366774859,
2: 0.0007071719017031057,
3: 0.0011196888443632507,
4: 0.00011786198361718427},
'Var': {0: 'A', 1: 'C', 2: 'A', 3: 'C', 4: 'A'},
'WT': {0: 'T', 1: 'T', 2: 'T', 3: 'T', 4: 'C'}})
Try this:
def calc(x):
return ztest(x['Background'], value=float(x['VAF']))[1]
sig_vars['pval'] = sig_vars.dropna().apply(calc, axis=1)
I am working with movie data and have a dataframe column for movie genre. Currently the column contains a list of movie genres for each movie (as most movies are assigned to multiple genres), but for the purpose of this analysis, I would like to parse the list and create a new dataframe column for each genre. So instead of having genre=['Drama','Thriller'] for a given movie, I would have two columns, something like genre1='Drama' and genre2='Thriller'.
Here is a snippet of my data:
{'color': {0: [u'Color::(Technicolor)'],
1: [u'Color::(Technicolor)'],
2: [u'Color::(Technicolor)'],
3: [u'Color::(Technicolor)'],
4: [u'Black and White']},
'country': {0: [u'USA'],
1: [u'USA'],
2: [u'USA'],
3: [u'USA', u'UK'],
4: [u'USA']},
'genre': {0: [u'Crime', u'Drama'],
1: [u'Crime', u'Drama'],
2: [u'Crime', u'Drama'],
3: [u'Action', u'Crime', u'Drama', u'Thriller'],
4: [u'Crime', u'Drama']},
'language': {0: [u'English'],
1: [u'English', u'Italian', u'Latin'],
2: [u'English', u'Italian', u'Spanish', u'Latin', u'Sicilian'],
3: [u'English', u'Mandarin'],
4: [u'English']},
'rating': {0: 9.3, 1: 9.2, 2: 9.0, 3: 9.0, 4: 8.9},
'runtime': {0: [u'142'],
1: [u'175'],
2: [u'202', u'220::(The Godfather Trilogy 1901-1980 VHS Special Edition)'],
3: [u'152'],
4: [u'96']},
'title': {0: u'The Shawshank Redemption',
1: u'The Godfather',
2: u'The Godfather: Part II',
3: u'The Dark Knight',
4: u'12 Angry Men'},
'votes': {0: 1793199, 1: 1224249, 2: 842044, 3: 1774083, 4: 484061},
'year': {0: 1994, 1: 1972, 2: 1974, 3: 2008, 4: 1957}}
Any help would be greatly appreciated! Thanks!
I think you need DataFrame constructor with add_prefix and last concat to original:
df1 = pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')
df = pd.concat([df.drop('genre',axis=1), df1], axis=1)
Timings:
df = pd.DataFrame(d)
print (df)
#5000 rows
df = pd.concat([df]*1000).reset_index(drop=True)
In [394]: %timeit (pd.concat([df.drop('genre',axis=1), pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')], axis=1))
100 loops, best of 3: 3.4 ms per loop
In [395]: %timeit (pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1))
1 loop, best of 3: 757 ms per loop
This should work for you:
pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1)