I have a dataframe with a large multiindex, sourced from a vast number of csv files. Some of those files have errors in the various labels, ie. "window" is missspelled as "winZZw", which then causes problems when I select all windows with df.xs('window', level='middle', axis=1).
So I need a way to simply replace winZZw with window.
Here's a very minimal sample df: (lets assume the data and the 'roof', 'window'… strings come from some convoluted text reader)
header = pd.MultiIndex.from_product(['roof', 'window', 'basement'], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product(['roof', 'winZZw', 'basement'], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product(['roof', 'door', 'basement'], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)
Now I want to xs a new dataframe for all the houses that have a window at their middle level: windf = df.xs('window', level='middle', axis=1)
But this obviously misses the misspelled winZZw.
So, how I replace winZZw with window?
The only way I found was to use set_levels, but if I understood that correctly, I need to feed it the whole level, ie
df.columns.set_levels([u'window',u'window', u'door'], level='middle',inplace=True)
but this has two issues:
I need to pass it the whole index, which is easy in this sample, but impossible/stupid for a thousand column df with hundreds of labels.
It seems to need the list backwards (now, my first entry in the df has door in the middle, instead of the window it had). That can probably be fixed, but it seems weird
I can work around these issues by xsing a new df of only winZZws, and then setting the levels with set_levels(df.shape[1]*[u'window'], level='middle') and then concatting it together again, but I'd like to have something more straightforward analog to str.replace('winZZw', 'window'), but I can't figure out how.
Use rename with specifying level:
header = pd.MultiIndex.from_product([['roof'],[ 'window'], ['basement']], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product([['roof'], ['winZZw'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product([['roof'], ['door'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)
df = df.rename(columns={'winZZw':'window'}, level='middle')
print(df.head())
top roof
middle window door
bottom basement basement basement
2000-01-01 -0.131052 -1.189049 1.310137
2000-02-01 -0.200646 1.893930 2.124765
2000-03-01 -1.690123 -2.128965 1.639439
2000-04-01 -0.794418 0.605021 -2.810978
2000-05-01 1.528002 -0.286614 0.736445
Related
I need to edit data in a spreadsheet as below-
Replace: if date already exists in spreadsheet;
Append: if date doesn't exist in spreadsheet
Sample data attached below-
Kindly help.
Use concat with DataFrame.drop_duplicates and DataFrame.sort_values:
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
df = (pd.concat([df1, df2])
.drop_duplicates('Date', keep='last')
.sort_values('Date', ignore_index=True))
Use:
df = pd.DataFrame({'date':pd.date_range('2021-6-1', '2021-6-15'), 'price': range(15)})
new_df = pd.DataFrame({'date':pd.date_range('2021-6-11', '2021-6-17'), 'price': range(15,22)})
df.merge(new_df, left_on='date', right_on='date', how='outer').apply(lambda x: x['price_y'] if not np.isnan(x['price_y']) else x['price_x'], axis = 1)
Result:
When I use aggregate function, the resulting columns 'price' and 'carat' have the same column name of 'mean'.
How do i rename the mean under the price to price_mean and under carat to carat_mean.
I can't change them individually.
diamonds.groupby('cut').agg({
'price': ['count', 'mean'],
'carat': 'mean'
}).rename(columns={'mean':'price_mean','mean':'carat_mean'}, level = 1)
})
You could try this:
# Rename columns of level 1
df1 = df["price"]
df1.columns = ["count", "carat_mean"]
df2 = df["carat"]
df2.columns = ["carat_mean"]
# Aggregate dfs (with renamed columns) under level 0 columns
df = pd.concat([df1, df2], axis=1, keys=['price', 'carat'])
print(df)
# Outputs
price carat
count carat_mean carat_mean
Fair 0.693995 -0.632283 0.789963
Good 0.099057 1.005623 0.143289
Ideal -0.277984 -0.105138 -0.611168
As the title suggest, I am trying to turn this
into this
The end-goal is ideally so that I'd be able to group-by them up and get a word count.
This is the code that I've tried so far. I am unsure.
df_unpivoted = final_df.melt(df.reset_index(), id_vars= 'index', var_name = 'Count', value_name = 'Value')
df = final_df.rename(columns=lambda x: x + x[-1] if x.startswith('index') else x)
df = pd.wide_to_long(df, ['0'], i='id', j='i')
df = df.reset_index(level=1, drop=True).reset_index()
pls review the code below, is there a more efficient way of splitting one DF into two? In the code below, the query is run twice. Would it be faster to just run the query once, and basically say if true send to DF1, else to DF2 ; or maybe after DF1 is created, someway to say that DF2 = DF minus DF1
code:
x1='john'
df = pd.read_csv(file, sep='\n', header=None, engine='python', quoting=3)
df = df[0].str.strip(' \t"').str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
df1= df[df.email.str.startswith(x1)]
df2= df[~df.email.str.startswith(x1)]
There's no need to compute the mask df.emailclean.str.startswith(x1) twice.
mask = df.emailclean.str.startswith(x1)
df1 = df[mask].copy() # in order not have SettingWithCopyWarning
df2 = df[~mask].copy() # https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
I'm trying to analyze a network traffic dataset with +1.000.000 of packets and I have the following code:
pcap_data = pd.read_csv('/home/alexfrancow/AAA/data1.csv')
pcap_data.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'len']
pcap_data['info'] = "null"
pcap_data.parse_dates=["time"]
pcap_data['num'] = 1
df = pcap_data
df
%%time
df['time'] = pd.to_datetime(df['time'])
df.index = df['time']
data = df.copy()
data_group = pd.DataFrame({'count': data.groupby(['ipdst', 'proto', data.index]).size()}).reset_index()
pd.options.display.float_format = '{:,.0f}'.format
data_group.index = data_group['time']
data_group
data_group2 = data_group.groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index().dropna()
data_group2
The first part of the script when I import the .csv runtime is 5 seconds, but when pandas groupby IP + PROTO, and resample the time in 5s, the runtime is 15 minutes, does anyone know how I can get a better performance?
EDIT:
Now I'm trying to use dask, and I have the following code:
Import the .csv
filename = '/home/alexfrancow/AAA/data1.csv'
df = dd.read_csv(filename)
df.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'info']
df.parse_dates=["time"]
df['num'] = 1
%time df.head(2)
Group by ipdst + proto by 5S freq
df.set_index('time').groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index()
How can I group by IP + PROTO by 5S frequency?
I try a bit simplify your code, but if large DataFrame performance should be only a bit better:
pd.options.display.float_format = '{:,.0f}'.format
#convert time column to DatetimeIndex
pcap_data = pd.read_csv('/home/alexfrancow/AAA/data1.csv',
parse_dates=['time'],
index_col=['time'])
pcap_data.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'len']
pcap_data['info'] = "null"
pcap_data['num'] = 1
#remove DataFrame constructor
data_group = pcap_data.groupby(['ipdst', 'proto', 'time']).size().reset_index(name='count')
data_group2 = (data_group.set_index('time')
.groupby(['ipdst','proto'])
.resample('5S')
.sum()
.reset_index()
.dropna())
in dask:
meta = pd.Dataframe(columns=['no','ipsrc','info'],dtype=object, index=pd.MultiIndex([[], [],[]],[[],[], []], names=['ipdst','proto','time'])
df = df.set_index('time').groupby(['ipdst','proto']).apply(lambda x:x.resample('5S').sum(),meta=meta)
df = df.reset_index()
Hope it work for you