I'm working on some data manipulations with time intervals, and have two time formats in the pandas dataframe. Every first occurrence of the time interval is duplicated (1:221:22 in the example below), and the second occurrence is in quotations and preceded by two commas. How can I manipulate the data as effectively as possible?
From example data:
obs1, 1:221:22,
obs2, ",,1:22"
To:
obs1, 1:22,
obs2, 1:22
First you need one filter to separate how to threat the columns.
filter_commas = (df[comma_column].str.startswith(",,"))
Then you have to threat based on your data.
#First removing all the commas at start
df.loc[filter_commas,column_name] = df.loc[filter_commas, column_name].str.replace(",","")
Then you have to split the data for the ones that aren't
#Splitting the rest of rows based in half of the row length
df.loc[~filter_commas,column_name] = df.loc[~filter_commas,column_name].apply(lambda row_val: row_val[:len(row_val)/2])
The code maybe wrong but this should put you in the right track
Related
How to limit number of rows in pandas dataframe in python code. I needed last 1000 rows the rest need to delete.
For example 1000 rows, in pandas dataframe -> 1000 rows in csv.
I tried df.iloc[:1000]
I needed autoclean pandas dataframe and saving last 1000 rows.
If you want first 1000 records you can use:
df = df.head(1000)
With df.iloc[:1000] you get the first 1000 rows.
Since you want to get the last 1000 rows, you have to change this line a bit to df_last_1000 = df.iloc[-1000:]
To safe it as a csv file you can use pandas' to_csv() method: df_last_1000.to_csv("last_1000.csv")
Are you trying to limit the number of rows when importing a csv, or when exporting a dataframe to a new csv file?
Importing first 1000 rows of csv:
df_limited = pd.read_csv(file, nrows=1000)
Get first 1000 rows of a dataframe (for export):
df_limited = df.head(1000)
Get last 1000 rows of a dataframe (for export):
df_limited = df.tail(1000)
Edit 1
As you are exporting a csv:
You can make a range selection with [n:m] where n is the starting point of your selection and m is the end point.
It works like this:
If the number is positive, it's counting from the top of the list, beginning of the string, top of the dataframe etc.
If the number is negative, it counts from the back.
[5:] selects everything from the 5th element to the end (as there is
no end point given)
[3:8] selects everything from the 3rd element up to the 8th
[5:-2] selects everything from the 5th element up to the 2nd to last
(the 2nd from the back)
[-1000:] the start point is 1000 elements from the back, the end
point is the last element (this is what you wanted, i think)
[:1000] selects the first 1000 lines (start point is the beginning, as there is no number given, end point is 1000 elements from the front)
Edit 2
After a quick check (and a very simple benchmark) it looks like df.tail(1000) is significantly faster than df.iloc[-1000:]
I'm trying to compare two bioinformatic DataFrames (one with transcription start and end genomic locations, and one with expression data). I need to check if any of a list of locations in one DataFrame is present within ranges defined by the start and end locations in the other DataFrame, returning rows/ids where they match.
I have tried a number of built-in methods (.isin, .where, .query,), but usually get stuck because the lists are nonhashable. I've also tried a nested for loop with iterrows and itertuples, which is exceedingly slow (my actual datasets are thousands of entries).
tss_df = pd.DataFrame(data={'id':['gene1','gene2'],
'locs':[[21,23],[34,39]]})
exp_df = pd.DataFrame(data={'gene':['geneA','geneB'],
'start': [15,31], 'end': [25,42]})
I'm looking to find that the row with id 'gene1' in tss_df has locations (locs) that match 'geneA' in exp_df.
The output would be something like:
output = pd.DataFrame(data={'id':['gene1','gene2'],
'locs': [[21,23],[34,39]],
'match': ['geneA','geneB']})
Edit: Based on a comment below, I tried playing with merge_asof:
pd.merge_asof(tss_df,exp_df,left_on='locs',right_on='start')
This gave me an incompatible merge keys error, I suspect because I'm comparing a list to integer; so I split out the first value in locs:
tss_df['loc1'] = tss_df['locs'][0]
pd.merge_asof(tss_df,exp_df,left_on='loc1',right_on='start')
This appears to have worked for my test data, but I'll need to try it with my actual data!
Based on a comment below, I tried playing with merge_asof:
pd.merge_asof(tss_df,exp_df,left_on='locs',right_on='start')
This gave me an incompatible merge keys error, I suspect because I'm comparing a list to integer; so I split out the first value in locs:
tss_df['loc1'] = tss_df['locs'][0]
pd.merge_asof(tss_df,exp_df,left_on='loc1',right_on='start')
This appears to have worked for my test data!
can you please suggest me an easy way to convert time periods to the corresponding indexes?
I have a function that picks entries from data frames based on numerical indexes (from 10th to 20th row) that I can not change. At the same time my data frame has time indexes and I have picked parts of it based on timestamps. How to convert those timestamps to the corresponding numerical indexes?
Thanks a lot
Alex
Adding some examples:
small_df.index[1]
Out[894]: Timestamp('2019-02-08 07:53:33.360000')
small_df.index[10]
Out[895]: Timestamp('2019-02-08 07:54:00.149000') # instead of time stamps.
These are the time period I want to pick from a second data frame that has time indexing as well. But I want to do that with numerical indexing
That means then
1. Find which numerical indexes correspond to the time period above
Based on the comment above this might be quite close on what I need:
start=second_dataframe.index.get_loc(pd.Timestamp(small_df.index[1]))
end=second_dataframe.index.get_loc(pd.Timestamp(small_df.index[10]))
picked_rows= second_dataframe[start:end]
Is there a better way to do that?
I believe you need Index.get_loc if need position:
small_df.index.get_loc(pd.Timestamp('2019-02-08 07:53:33.360000'))
1
EDIT: If values always matched, is possible get timestamp form first and extract second rows by DataFrame.loc:
start = small_df.index[1]
end = small_df.index[10]
picked_rows = second_dataframe.loc[start:end]
OrL
start=pd.Timestamp(small_df.index[1])
end=pd.Timestamp(small_df.index[10])
picked_rows = second_dataframe.loc[start:end]
I am sure I am missing a simple solution but I have been unable to figure this out, and have yet to find the answer in the existing questions. (If it is not obvious, I am a hack and just learning Python)
Lets say I have two data frames (DataFileDF, SelectedCellsRaw) with the same two key fields (MRBTS, LNCEL) and I want a subset of the first data frame (DataFileDF) containing only the corresponding key pairs in the second data frame.
e.g. rows of DataFileDF with Keys that correspond to the keys of Selected CellsRaw.
Note this needs to match by key pair MRBTS + LNCEL not each key individually.
I tried:
SelectedCellsRaw = DataFileDF.loc[DataFileDF['MRBTS'].isin(SelectedCells['MRBTS']) & DataFileDF['LNCEL'].isin(SelectedCells['LNCEL'])]
I get the MRBTS's, but also every occurrence of LNCEL (it has a possible range of 0-9 so there are many duplicates throughout the data set).
One way you could do is to use isin with indexes:
joincols = ['MRBTS','LNCEL']
DataFileDF[DataFileDF.set_index(joincols).index.isin(SelectedCellsRaw.set_index(joincols).index)]
I have a very large dataframe (close to 1 million rows), which has a couple of meta data columns and one single column that contains a long string of triples. One string could look like this:
0,0,123.63;10,360,2736.11;30,270,98.08;...
That is, three values separated by comma and then separated by semicolon. Let us refer to the three values as IN, OUT, MEASURE. Effectively i want to group my data by the original columns + the IN & OUT columns and then sum over the MEASURE column. Since each long string contains roughly 30 triples my dataframe would grow to be ~30 million rows if i simply unstacked the data. Obviously this is not feasible.
So given a set of columns (which may in- or exclude the IN & OUT columns) over which I want to group and then sum my MEASURE data, how would I efficiently strip out the relevant data and sum everything up without blowing up my memory?
My current solution simply loops over each row and then over each triple and keeps a running total of each group I specified. This is very slow, so I am looking for something faster, perhaps vectorised. Any help would be appreciated.
Edit: Sample data below (columns separated by pipe)
DATE|REGION|PRIORITY|PARAMETERS
10-Oct-2016|UK|High|0,0,77.82;30,90,7373.70;
10-Oct-2016|US|Low|0,30,7.82;30,90,733.70;
11-Oct-2016|UK|High|0,0,383.82;40,90,713.75;
12-Oct-2016|NA|Low|40,90,937.11;30,180,98.23;
where PARAMETERS has the form 'IN,OUT,MEASURE;IN,OUT,MEASURE;...'
I basically want to (as an example) create a pivot table where
values=MEASURE
index=DATE, IN
columns=PRIORITY