I have a pandas dataframe df
name e_count e_start e_end
aaaa 3 13,14,15, 18,20,25,
bbbb 2 90,94, 100,102,
The field e_count described the number of elements in e_start and e_end. I want to make a new data frame that adds a column e_end-e_start. For example
name e_count e_start e_end e_end-e_start
aaaa 3 13,14,15, 18,20,25, 5,6,10,
bbbb 2 90,94, 100,102, 10,8,
I tried the following :
df['e_end-e_start'] = ""
new_frame = pd.DataFrame(columns = df.columns)
new_frame['e_end-e_start'] = ""
new_frame_idx = -1
for idx,row in df.iterrows():
new_frame_idx = new_frame_idx + 1
new_row = df.ix[idx]
new_frame = new_frame.append(new_row,ignore_index = True)
df.ix[idx,'e_end-e_start'] =df.ix[idx,'e_end']-df.ix[idx,'target_end']
new_frame.ix[new_frame_idx,'e_end-e_start'] =df.ix[idx,'e_end-e_start'] =df.ix[idx,'e_end']-df.ix[idx,'target_end']
print new_frame
But I get an error. Can you help?
Generally, you'll get much better performance storing your data as ints instead
of a numeric strings seperated by commas. A flat, long format such as
In [73]: df
Out[73]:
name e_start e_end
0 aaaa 13 18
0 aaaa 14 20
0 aaaa 15 25
1 bbb 90 100
1 bbb 94 102
makes computation much easier. Here is how you can convert your DataFrame to the
flat format:
import pandas as pd
df = pd.DataFrame({'e_count': [3, 2],
'e_end': ['18,20,25,', '100,102,'],
'e_start': ['13,14,15,', '90,94,'],
'name': ['aaaa', 'bbb']})
dfs = []
for col in ['e_start', 'e_end']:
tmp = df[col].str.strip(',').str.split(',').apply(pd.Series)
tmp = tmp.stack()
tmp.index = tmp.index.droplevel(-1)
tmp.name = col
tmp = tmp.astype(int)
dfs.append(tmp)
df = pd.concat([df[['name']]]+dfs, axis=1)
Then, to compute the differences, you could use
df['diff'] = df['e_end'] - df['e_start']
To convert back to the comma separated strings,
In [76]: df.groupby('name').agg(lambda x: ','.join(x.astype(str)))
Out[76]:
e_start e_end diff
name
aaaa 13,14,15 18,20,25 5,6,10
bbb 90,94 100,102 10,8
Related
I would like to loop into some variable name and the equivalent column with an added suffix "_plus"
#original dataset
raw_data = {'time': [2,1,4,2],
'zone': [5,1,3,0],
'time_plus': [5,6,2,3],
'zone_plus': [0,9,6,5]}
df = pd.DataFrame(raw_data, columns = ['time','zone','time_plus','zone_plus'])
df
#desired dataset
df['time']=df['time']*df['time_plus']
df['zone']=df['zone']*df['zone_plus']
df
I would like to do the multiplication in a more elegant way, through a loop, since I have many variables with this pattern: original name * transformed variable with the _plus suffix
something similar to this or better
my_list=['time','zone']
for i in my_list:
df[i]=df[i]*df[i+"_plus"]
Try:
for c in df.filter(regex=r".*(?<!_plus)$", axis=1):
df[c] *= df[c + "_plus"]
print(df)
Prints:
time zone time_plus zone_plus
0 10 0 5 0
1 6 9 6 9
2 8 18 2 6
3 6 0 3 5
Or:
for c in df.columns:
if not c.endswith("_plus"):
df[c] *= df[c + "_plus"]
raw_data = {'time': [2,1,4,2],
'zone': [5,1,3,0],
'time_plus': [5,6,2,3],
'zone_plus': [0,9,6,5]}
df = pd.DataFrame(raw_data, columns = ['time','zone','time_plus','zone_plus'])
# Take every column that doesn't have a "_plus" suffix
cols = [i for i in list(df.columns) if "_plus" not in i]
# Calculate new columns
for col in cols:
df[str(col+"_2")] = df[col]*df[str(col+"_plus")]
I decided to create the new columns with a "_2" suffix, this way we don't mess up the original data.
for c in df.columns:
if f"{c}_plus" in df.columns:
df[c] *= df[f"{c}_plus"]
I want to remove the all string after last underscore from the dataframe. If I my data in dataframe looks like.
AA_XX,
AAA_BB_XX,
AA_BB_XYX,
AA_A_B_YXX
I would like to get this result
AA,
AAA_BB,
AA_BB,
AA_A_B
You can do this simply using Series.str.split and Series.str.join:
In [2381]: df
Out[2381]:
col1
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
In [2386]: df['col1'] = df['col1'].str.split('_').str[:-1].str.join('_')
In [2387]: df
Out[2387]:
col1
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _ and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA (values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)
Here is another way of going about it.
import pandas as pd
data = {'s': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']}
df = pd.DataFrame(data)
def cond1(s):
temp_s = s.split('_')
temp_len = len(temp_s)
if len(temp_s) == 1:
return temp_s
else:
return temp_s[:len(temp_s)-1]
df['result'] = df['s'].apply(cond1)
Let's say that I have a simple Dataframe.
import pandas as pd
data1 = [12,34,'fsdf',678,'','','dfs','','']
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4
5
6 dfs
7
8
I want to delete all the data except the last value found in the column that I want to keep in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 dfs
1
2
3
4
5
6
7
8
And I have to keep the shape of this dataframe, so not removing rows.
What are the simplest functions to do that efficiently ?
Thank you
Get index of last not empty string value and pass to first value of column:
s = df1.loc[df1['Data'].iloc[::-1].ne('').idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
If empty strings are missing values:
data1 = [12,34,'fsdf',678,np.nan,np.nan,'dfs',np.nan,np.nan]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 fsdf
3 678
4 NaN
5 NaN
6 dfs
7 NaN
8 NaN
s = df1.loc[df1['Data'].iloc[::-1].notna().idxmax(), 'Data']
print (s)
dfs
df1['Data'] = ''
df1.loc[0, 'Data'] = s
print (df1)
Data
0 dfs
1
2
3
4
5
6
7
8
A simple pandas condition check like this can help,
df1['Data'] = [df1.loc[df1['Data'].ne(""), "Data"].iloc[-1]] + [''] * (len(df1) - 1)
You can replace '' with NaN using df.replace, now use df.last_valid_index
val = df1.loc[df1.replace('', np.nan).last_valid_index(), 'Data']
# Below two lines taken from #jezrael's answer
df1.loc[0, 'Data'] = val
df1.loc[1:, 'Data'] = ''
Or
You can use np.full with fill_value set to np.nan here.
val = df1.loc[df1.replace("", np.nan).last_valid_index(), "Data"]
df1 = pd.DataFrame(np.full(df1.shape, np.nan),
index=df.index,
columns=df1.columns)
df1.loc[0, "Data"] = val
I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None
I have a DataFrame that looks like this:
import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SP':[35.6,56.7,41,41],'1M':[-7.8,56,56,-3.4],'3M':[24,-31,53,5]}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','SP','1M','3M'])
print df
I would like to run a test on certain columns in this DataFrame only, all column names in this list:
check = {'1M','SP'}
print check
For these columns, I would like to know when the values in either of these columns is the same as the value on the previous day. So the output dataframe should return series date and a Comment such as (for the example in this case:)
output_data = {'Series_Date':['2017-03-14','2017-03-15'],'Comment':["Value for 1M data is same as previous day","Value for SP data is same as previous day"]}
output_data_df = pd.DataFrame(output_data,columns = ['Series_Date','Comment'])
print output_data_df
Could you please provide some assistance how to deal with this?
The following does more or less what you want.
Columns item_ok are added to the original dataframe specifying if the value is the same as previous day or not:
from datetime import timedelta
df['Date_diff'] = pd.to_datetime(df['Series_Date']).diff()
for item in check:
df[item+'_ok'] = (df[item].diff() == 0) & (df['Date_diff'] == timedelta(1))
df_output = df.loc[(df[[item + '_ok' for item in check]]).any(axis=1)]
I'm not sure it is the most clean way to do it. However, it works
check = {'1M', 'SP'}
prev_dict = {c: None for c in check}
def check_prev_value(row):
global prev_dict
msg = ""
# MAYBE add clause to check if both are equal
for column in check:
if row[column] == prev_dict[column]:
msg = 'Value for %s data is same as previous day' % column
prev_dict[column] = row[column]
return msg
df['comment'] = df.apply(check_prev_value, axis=1)
output_data_df = df[df['comment'] != ""]
output_data_df = output_data_df[["Series_Date", "comment"]].reset_index(drop=True)
For your input:
Series_Date SP 1M 3M
0 2017-03-10 35.6 -7.8 24
1 2017-03-13 56.7 56.0 -31
2 2017-03-14 41.0 56.0 53
3 2017-03-15 41.0 -3.4 5
The output is:
Series_Date comment
0 2017-03-14 Value for 1M data is same as previous day
1 2017-03-15 Value for SP data is same as previous day
Reference: this answer
cols = ['1M','SP']
for col in cols:
df[col + '_dup'] = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount()
Output column will have an integer greater than zero when a duplicate is found.
df:
Series_Date SP 1M 3M 1M_dup SP_dup
0 2017-03-10 35.6 -7.8 24 0 0
1 2017-03-13 56.7 56.0 -31 0 0
2 2017-03-14 41.0 56.0 53 1 0
3 2017-03-15 41.0 -3.4 5 0 1
Slice to find dups:
col = 'SP'
dup_df = df[df[col + '_dup'] > 0][['Series_Date', col + '_dup']]
dup_df:
Series_Date SP_dup
3 2017-03-15 1
Here is a function version of the above (with the added feature of handling multiple columns):
import pandas as pd
import numpy as np
def find_repeats(df, col_list, date_col='Series_Date'):
dummy_df = df[[date_col, *col_list]].copy()
dates = dummy_df[date_col]
date_series = []
code_series = []
if len(col_list) > 1:
for col in col_list:
these_repeats = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount().values
repeat_idx = list(np.where(these_repeats > 0)[0])
date_arr = dates.iloc[repeat_idx]
code_arr = [col] * len(date_arr)
date_series.extend(list(date_arr))
code_series.extend(code_arr)
return pd.DataFrame({date_col: date_series, 'col_dup': code_series}).sort_values(date_col).reset_index(drop=True)
else:
col = col_list[0]
dummy_df[col + '_dup'] = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount()
return dummy_df[dummy_df[col + '_dup'] > 0].reset_index(drop=True)
find_repeats(df, ['1M'])
Series_Date 1M 1M_dup
0 2017-03-14 56.0 1
find_repeats(df, ['1M', 'SP'])
Series_Date col_dup
0 2017-03-14 1M
1 2017-03-15 SP
And here is another way using pandas diff:
def find_repeats(df, col_list, date_col='Series_Date'):
code_list = []
dates = list()
for col in col_list:
these_dates = df[date_col].iloc[np.where(df[col].diff().values == 0)[0]].values
code_arr = [col] * len(these_dates)
dates.extend(list(these_dates))
code_list.extend(code_arr)
return pd.DataFrame({date_col: dates, 'val_repeat': code_list}).sort_values(date_col).reset_index(drop=True)