I have a pandas dataframe with a pipe delimited column with an arbitrary number of elements, called Parts. The number of elements in these pipe-strings varies from 0 to over 10. The number of unique elements contained in all pipe-strings is not much smaller than the number of rows (which makes it impossible for me to manually specify all of them while creating new columns).
For each row, I want to create a new column that acts as an indicator variable for each element of the pipe delimited list. For instance, if the row
...'Parts'...
...'12|34|56'
should be transformed to
...'Part_12' 'Part_34' 'Part_56'...
...1 1 1...
Because they are a lot of unique parts, these columns are obviously going to be sparse - mostly zeros since each row only contains a small fraction of unique parts.
I haven't found any approach that doesn't require manually specifying the columns (for instance, Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries).
I've also looked at pandas' melt, but I don't think that's the appropriate tool.
The way I know how to solve it would be to pipe the raw CSV to another python script and deal with it on a char-by-char basis, but I need to work within my existing script since I will be processing hundreds of CSVs in this manner.
Here's a better illustration of the data
ID YEAR AMT PARTZ
1202 2007 99.34
9321 1988 1012.99 2031|8942
2342 2012 381.22 1939|8321|Amx3
You can use get_dummies and add_prefix:
df.Parts.str.get_dummies().add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 1 1 1
Edit for comment and counting duplicates.
df = pd.DataFrame({'Parts':['12|34|56|12']}, index=[0])
pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 2 1 1
Related
I have two dataframes. One is very large and has over 4 million rows of data while the other has about 26k. I'm trying to create a dictionary where the keys are the strings of the smaller data frame. This dataframe (df1) contains substrings or incomplete names and the larger dataframe (df2) contains full names/strings and I want to check if if the substring from df1 is in strings in df2 and then create my dict.
No matter what I try, my code takes long and I keep looking for faster ways to iterate through the df's.
org_dict={}
for rowi in df1.itertuples():
part = rowi.part_name
full_list = []
for rowj in df2.itertuples():
if part in rowj.full_name:
full_list.append(full_name)
org_dict[part]=full_list
Am I missing a break or is there a faster way to iterate through really large dataframes of way over 1 million rows?
Sample data:
df1
part_name
0 aaa
1 bb
2 856
3 cool
4 man
5 a0
df2
full_name
0 aaa35688d
1 coolbbd
2 8564578
3 coolaaa
4 man4857684
5 a03567
expected output:
{'aaa':['aaa35688d','coolaaa'],
'bb':['coolbbd'],
'856':['8564578']
...}
etc
The issue here is that nested for loops perform very badly time-wise as the data grows larger. Luckily, pandas allows us to perform vectorised operations across rows/columns.
I can't properly test without having access to a sample of your data, but I believe this does the trick and performs much faster:
org_dict = {substr: df2.full_name[df2.full_name.str.contains(substr)].tolist() for substr in df1.part_name}
I have 2 pandas dataframes, df_pe and df_merged. Both the dataframes have several rows, as well as several columns. Now, there are some specific things I would like to accomplish using these dataframes:
In df_merged, there is a column named ST, which contains timestamps of various events in format eg. (2017-08-27 00:00:00). In df_pe, there are 2 columns Ton and Toff which contain the time when an event started and when and event ended. Eg. (Ton value for a random row: 2018-08-17 01:20:00 while Toff value 2018-08-17 02:30:00).
Secondly, there is a column in df_pe, namely EC. I have another dataframe called df_uniqueal, which also has a column called EC. What I would like to do is:
a. For all rows in df_merged, whenever the ST value is within the duration of Ton and Toff in the df_pe, create 2 new columns in df_merged: EC and ED. Append/Put the value of the EC from data frame df_pe into this new column, while, put the value of the dataframe df_uniqueal into the new column ED (which is eventually a mapped version of the EC in df_pe for obtaining ED in df_uniqueal). If none of the conditions matches/there are NaNs (missing values) left after this procedure, put the string "NF" into df_merged's new ED column, while put the integer 0 into the df_merged's new EC column.
I have explored SO and SE, but have not found anything substantial. Any help in this regard is highly appreciated.
This is my attempt at using for loops in Python for iterating over the dataframes for accomplishing the first condition but it runs forever (never ending) and I don't think this is the best possible way to accomplish this.
for i in range(len(df_merged)):
for j in range(len(df_pe)):
if df_pe.TOn[j] < df_merged.ST[i] < df_pe.TOff[j]:
df_merged.EC[i] = df_pe.EC[j]
df_merged.ED[i] = df_uniqueal.ED[df_processed.EC[j]]
else:
df_merged.EC[i] = 0
df_merged.ED[i] = "NF"
EDIT
Please refer image for expected output and baby example of dataframes.
The relevant columns are in bold (note the column numbers may differ, but the column names are same in this sample example).
If I have understood the question correctly hopefully this will get you started.
for i,val in df_merged['ST'].items():
bool_idx = (df_pe['TOn']<val)&(val<df_pe['Toff'])
if df_pe[bool_idx]['EC'].empty:
df_merged.loc[i,'EC']=0
df_merged.loc[i,'ED']="NF"
else:
value_from_df_pe = df_pe[bool_idx]['EC']
df_merged.loc[i,'EC']= value_from_df_pe
value_from_df_uniqueal = df_uniqueal[df_uniqueal['EC']==value_from_df_pe]['EC']
df_merged.loc[i,'ED']= value_from_df_uniqueal
Please note I have note tested this code on any data.
I have two large tables with one of them is relatively small ~8Million rows and one column. Other is large 173Million rows and one column. The index of the first data frame is IntervalIndex (eg (0,13], (13, 20], (20, 23], ...) and the second one is ordered numbers (1,2,3, ...). Both DataFrame are sorted so
DF1 category
(0,13] 1
(13 20] 2
....
Df2 Value
1 5.2
2 3.4
3 7.8
Desired
Df3
index value category
1 5.2 1
2 3.4 1
3 7.8 1
I want two obtain inner join (faster algorithm) that returns inner join similar to MySQL on data_frame2.index
I would like to be able to perform it in an elaborate way in a cluster because when I PRODUCED THE INNERJOIN WITH SMALLER SECOND DATASET THE RESULT ARE SO MEMORY CONSUMING IMAGINE 105MEGABYTE for 10 rows using map_partitions.
Another problem is that I cannot use scatter twice, given if first DaskDF=client.scatter(dataframe2) followed by DaskDF=client.submit(fun1,DaskDF) I am unable to do sth like client.submit(fun2,DaskDF).
You might try using smaller partitions. Recall that the memory use of joins depend on how many shared rows there are. Depending on your data the memory use of an output partition may be much larger than the memory use of your input partitions.
I am working on some huge volume of data, rows around 50 millions.
I want to find unique columns values from multiple columns. I use below script.
dataAll[['Frequency', 'Period', 'Date']].drop_duplicates()
But this is taking long time, more than 40minutes.
I found some alternative:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
but above script will give array, but I need in dataframe like first script will give as below
Generaly your new code is imposible convert to DataFrame, because:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
create one big 1d numpy array, so after remove duplicates is impossible recreate rows.
E.g. if there are 2 unique values 3 and 1 is impossible find which datetimes are for 3 and for 1.
But if there is only one unique value for Frequency and for each Period is possible find Date like in sample, solution is possible.
EDIT:
One possible alternative is use dask.dataframe.DataFrame.drop_duplicates.
I have some data in a pandas dataframe which looks like this:
gene VIM
time:2|treatment:TGFb|dose:0.1 -0.158406
time:2|treatment:TGFb|dose:1 0.039158
time:2|treatment:TGFb|dose:10 -0.052608
time:24|treatment:TGFb|dose:0.1 0.157153
time:24|treatment:TGFb|dose:1 0.206030
time:24|treatment:TGFb|dose:10 0.132580
time:48|treatment:TGFb|dose:0.1 -0.144209
time:48|treatment:TGFb|dose:1 -0.093910
time:48|treatment:TGFb|dose:10 -0.166819
time:6|treatment:TGFb|dose:0.1 0.097548
time:6|treatment:TGFb|dose:1 0.026664
time:6|treatment:TGFb|dose:10 -0.008032
where the left is an index. This is just a subsection of the data which is actually much larger. The index is composed of three components, time, treatment and dose. I want to reorganize this data such that I can access it easily by slicing. The way to do this is to use pandas MultiIndexing but I don't know how to convert my DataFrame with one index into another with three. Does anybody know how to do this?
To clarify, the desired output here is the same data with a three level index, the outer being treatment, middle is dose and the inner being time. This would be useful so then I could access the data with something like df['time']['dose'] or 'df[0]` (or something to that effect at least).
You can first replace unnecessary strings (index has to be converted to Series by to_series, because replace doesnt work with index yet) and then use split. Last set index names by rename_axis (new in pandas 0.18.0)
df.index = df.index.to_series().replace({'time:':'','treatment:': '','dose:':''}, regex=True)
df.index = df.index.str.split('|', expand=True)
df = df.rename_axis(('time','treatment','dose'))
print (df)
VIM
time treatment dose
2 TGFb 0.1 -0.158406
1 0.039158
10 -0.052608
24 TGFb 0.1 0.157153
1 0.206030
10 0.132580
48 TGFb 0.1 -0.144209
1 -0.093910
10 -0.166819
6 TGFb 0.1 0.097548
1 0.026664
10 -0.008032