I have a csv which is generated in a format that I can not change. The file has a multi index. The file looks like this.
The end goal is to turn the top row (hours) into an index, and index it with the "ID" column, so that the data looks like this.
I have imported the file into pandas...
myfile = 'c:/temp/myfile.csv'
df = pd.read_csv(myfile, header=[0, 1], tupleize_cols=True)
pd.set_option('display.multi_sparse', False)
df.columns = pd.MultiIndex.from_tuples(df.columns, names=['hour', 'field'])
df
But that gives me three unnamed fields:
My final step is to stack on hour:
df.stack(level=['hour'])
But I a missing what comes before that, where I can index the other columns, even though there's a blank multiindex line above them.
I believe the lines you are missing may be # 3 and 4:
df = pd.io.parsers.read_csv('temp.csv', header = [0,1], tupleize_cols = True)
df.columns = [c for _, c in df.columns[:3]] + [c for c in df.columns[3:]]
df = df.set_index(list(df.columns[:3]), append = True)
df.columns = pd.MultiIndex.from_tuples(df.columns, names = ['hour', 'field'])
Convert the tuples to strings by dropping the first value for first 3 col. headers.
Shelter these headers by placing them in an index.
After you perform the stack, you may reset the index if you like.
e.g.
Before
(Unnamed: 0_level_0, Date) (Unnamed: 1_level_0, id) \
0 3/11/2016 5
1 3/11/2016 6
(Unnamed: 2_level_0, zone) (100, p1) (100, p2) (200, p1) (200, p2)
0 abc 0.678 0.787 0.337 0.979
1 abc 0.953 0.559 0.776 0.520
After
field p1 p2
Date id zone hour
0 3/11/2016 5 abc 100 0.678 0.787
200 0.337 0.979
1 3/11/2016 6 abc 100 0.953 0.559
200 0.776 0.520
Related
I am new to core python. I have a working code which I need to convert into a method.
So, I have around 50k data with 30 columns. Out of 30 columns 3 columns are important for this requirement. Id,Code, and bill_id. I need to populate new column "multiple_instance" with 0s and 1s. Hence, final dataframe will contain 50k data with 31 columns. 'Code' column contains n number of codes, hence I am filtering my interest of codes and applying the remaining concept.
I need to pass these 3 columns in a method() which would return 0s and 1s.
Note: multiple_instance_codes is a variable which can be changed later.
multiple_instance_codes = ['A','B','C','D']
filt = df['Code'].str.contains('|'.join(multiple_instance_codes ), na=False,case=False)
df_mul = df[filt]
df_temp = df_mul.groupby(['Id'])[['Code']].size().reset_index(name='count')
df_mul = df_mul.merge(df_temp, on='Id', how='left')
df_mul['Cumulative_Sum'] = df_mul.groupby(['bill_id'])['count'].apply(lambda x: x.cumsum())
df_mul['multiple_instance'] = np.where(df_mul['Cumulative_Sum'] > 1, 1, 0)```
**Sample data :**
bill_id Id Code Cumulative_Sum multiple_instance
10 1 B 1 0
10 2 A 2 1
10 3 C 3 1
10 4 A 4 1
Nevermind, It is completed and working fine.
def multiple_instance(df):
df_scored = df.copy()
filt = df_scored['Code'].str.contains('|'.join(multiple_instance_codes), na=False,case=False)
df1 = df_scored[filt]
df_temp = df1.groupby(['Id'])[['Code']].size().reset_index(name='count')
df1 = df1.merge(df_temp, on='Id', how='left')
df1['Cum_sum'] = df1.groupby(['bill_id'])['count'].apply(lambda x: x.cumsum())
df_scored = df_scored.merge(df1)
df_scored['muliple instance'] = np.where(df_scored['Cumulative_Sum'] > 1, 1, 0)
return df_scored
I'm trying to achieve a dataframe transformation (kinda complicated for me) with Pandas, see image below.
The original dataframe source is an Excel sheet (here is an example) that looks exactly like the input here :
INPUT → OUTPUT (made by draw.io)
Basically, I need to do these transformations by order :
Select (in each block) the first four lines + the last two lines
Stack all the blocks together
Drop the last three unnamed columns
Select columns A and E
Fill down the column A
Create a new column N1 that holds a sequence of values (ID-01 to ID-06)
Create a new column N2 that concatente the first value of the block and its number
And for that, I made this code who unfortunately return a [0 rows × 56 columns] dataframe :
import pandas as pd
myFile = r"C:\Users\wasp_96b\Desktop\ExcelSheet.xlsx"
df1 = pd.read_excel(myFile, sheet_name = 'Sheet1')
df2 = (pd.wide_to_long(df1.reset_index(), 'A' ,i='index',j='value').reset_index(drop=True))
df2.ffill(axis = 0)
df2.insert(2, 'N1', 'ID-' + str(range(1, 1 + len(df2))))
df2.insert(3, 'N2', len(df2)//5)
display(df2)
Do you have any idea or explanation for this scenario ?
Is there any other ways I can obtain the result I'm looking for ?
The Column names in your code and in the data are not matching. However, from the data and the output you desire, I think I am able to solve your query. The code is very specific for the data you provided and you might need to change it later
CODE
import pandas as pd
myFile = "ExcelSheet.xlsx"
df = pd.read_excel(myFile, sheet_name='Sheet1')
# Forwad filling the column
df["Housing"] = df["Housing"].ffill()
# Select the first 4 lines and last two lines
df = pd.concat([df.head(4), df.tail(2)]).reset_index(drop=True)
# Drop the unneccsary columns
df = df.drop(columns=[col for col in df.columns if not (col.startswith("Elements") or col == "Housing")])
df.rename(columns={"Elements": "Elements.0"}, inplace=True)
# Stack all columns
df = pd.wide_to_long(df.reset_index(), stubnames=["Elements."], i="index", j="N2").reset_index("N2")
df.rename(columns={"Elements.": "Elements"}, inplace=True)
# Adding N1 and N2
df["N1"] = "ID_" + (df.index + 1).astype("str")
df["N2"] = df["Housing"] + "-" + (df["N2"] + 1).astype("str")
# Finishing up
df = df[["Housing", "Elements", "N1", "N2"]].reset_index(drop=True)
print(df.head(12))
OUTPUT(only first 12 rows)
Housing Elements N1 N2
0 OID1 1 ID_1 OID1-1
1 OID1 M-0368 ID_2 OID1-1
2 OID1 JUM ID_3 OID1-1
3 OID1 NODE-1 ID_4 OID1-1
4 OID4 BTM-B ID_5 OID4-1
5 OID4 1 ID_6 OID4-1
6 OID1 1 ID_1 OID1-2
7 OID1 M-0379 ID_2 OID1-2
8 OID1 JUM ID_3 OID1-2
9 OID1 NODE-2 ID_4 OID1-2
10 OID4 BTM-B ID_5 OID4-2
11 OID4 2 ID_6 OID4-2
I have dictionary that I am using to rename columns in a dataframe like so:
column_names = {name1:rename1, name2:rename2}
new_df = df[[k for k in column_names.keys()]]
new_df.rename(columns=columns_dict, inplace=True)
When a new field comes dictionary in I am getting this error at this line in the code:
code: new_df = df[[k for k in column_names.keys()]]
issue: KeyError: "['new_col'] not in index"
How do I create a flexible solution to the list comprehesion when renaming values, that if a value new value is present in the df or dictionary, include that column and assign it a value of zero?
I tried creating some try catches but I am not sure how to proceed after this point:
try:
new_df = df[[k for k in column_names.keys()]]
new_df.rename(columns=columns_dict, inplace=True)
except:
#assign the failed column back to original df (named df) and assign value
#of zero
#rerun all steps in try block.
Try changing this line of code
new_df = df[[k for k in column_names.keys() if k in df.columns]]
IIUC, you can divide the column_names into 2 parts: entries that are also in the columns of df and others. Then index & rename with the first part and assign with the second part:
# get the non-existent ones
to_assign = column_names.keys() - df.columns
# drop them
[column_names.pop(key) for key in to_assign]
# subset the `df` with what remained
ndf = df[column_names.keys()]
# rename them
ndf = ndf.rename(columns=column_names)
# assign the other ones (with zeros)
ndf = ndf.assign(**dict.fromkeys(to_assign, 0))
E.g.,
In []: df
Out[]:
L_1 D_1 L_2 D_2
0 1.0 7 NaN NaN
1 1.0 12 1-1 play
2 NaN -1 1-1 play
3 1.0 9 1-1 play
In []: column_names = {"L_1": "M_1", "L_4": "Z_5", "D_1": "E_1"}
In []: ndf = above_operations...
In []: ndf
Out[]:
M_1 E_1 L_4
0 1.0 7 0
1 1.0 12 0
2 NaN -1 0
3 1.0 9 0
I have a dataframe that has the user id in one column and a string consisting of comma-separated values of item ids for the items he possesses in the second column. I have to convert this into a resulting dataframe that has user ids as indices, and unique item ids as columns, with value 1 when that user has the item, and 0 when the user does not have the item. Attached below is the gist of the problem and the approach I am currently using to solve this problem.
temp = pd.DataFrame([[100, '10, 20, 30'],[200, '20, 30, 40']], columns=['userid','listofitemids'])
print(temp)
temp.listofitemids = temp.listofitemids.apply(lambda x:set(x.split(', ')))
dat = temp.values
df = pd.DataFrame(data = [[1]*len(dat[0][1])], index = [dat[0][0]], columns=dat[0][1])
for i in range(1, len(dat)):
t = pd.DataFrame(data = [[1]*len(dat[i][1])], index = [dat[i][0]], columns=dat[i][1])
df = df.append(t, sort=False)
df.head()
However, this code is clearly inefficient, and I am looking for a faster solution to this problem.
Let us try str.split with explode then crosstab
s = temp.assign(listofitemids=temp['listofitemids'].str.split(', ')).explode('listofitemids')
s = pd.crosstab(s['userid'], s['listofitemids']).mask(lambda x : x.eq(0))
s
Out[266]:
listofitemids 10 20 30 40
userid
100 1.0 1 1 NaN
200 NaN 1 1 1.0
I have imported data from a csv file into my program and then used set_index to set 'rule_id' as index. I used this code:
df = pd.read_excel('stack.xlsx')
df.set_index(['rule_id'])
and the data looks like this:
Now I want to compare one column with another but in reverse order , for eg; I want to compare 'c' data with 'b' , then compare 'b' with 'a' and so on and create another column after the comparison which contains the index of the column where the value was zero. If both the columns have value 0 , then Null should be updated in the new column and if both the comparison values are other than 0 , then also Null should be updated in the new column.
The result should look like this:
I am not able to write the code of how should I approach this problem, if you guys could help me , that would be great.
Edit: A minor edit. I have imported the data from an excel which looks like this , this is just a part of data , there are multiple columns:
Then I used pivot_table to manipulate the data as per my requirement using this code:
df = df.pivot_table(index = 'rule_id' , columns = ['date'], values = 'rid_fc', fill_value = 0)
and my data looks like this now:
Now I want to compare one column with another but in reverse order , for eg; I want to compare '2019-04-25 16:36:32' data with '2019-04-25 16:29:05' , then compare '2019-04-25 16:29:05' with '2019-04-25 16:14:14' and so on and create another column after the comparison which contains the index of the column where the value was zero. If both the columns have value 0 , then Null should be updated in the new column and if both the comparison values are other than 0 , then also Null should be updated in the new column.
IIUC you can try with:
d={i:e for e,i in enumerate(df.columns)}
m1=df[['c','b']]
m2=df[['b','a']]
df['comp1']=m1.eq(0).dot(m1.columns).map(d)
m3=m2.eq(0).dot(m2.columns)
m3.loc[m3.str.len()!=1]=np.nan
df['comp2']=m3.map(d)
print(df)
a b c comp1 comp2
rule_id
51234 0 7 6 NaN 0.0
53219 0 0 1 1.0 NaN
56195 0 2 2 NaN 0.0
I suggest use numpy - compare shifted values with logical_and and set new columns by range created by np.arange with swap order and numpy.where with DatFrame constructor:
df = pd.DataFrame({
'a':[0,0,0],
'b':[7,0,2],
'c':[6,1,2],
})
#change order of array
x = df.values[:, ::-1]
#compare for equal 0 and and not equal 0
a = np.logical_and(x[:, 1:] == 0, x[:, :-1] != 0)
#create range from top to 0
b = np.arange(a.shape[1]-1, -1, -1)
#new columns names
c = [f'comp{i+1}' for i in range(x.shape[1] - 1)]
#set values by boolean array a and set values
df1 = pd.DataFrame(np.where(a, b[None, :], np.nan), columns=c, index=df.index)
print (df1)
comp1 comp2
0 NaN 0.0
1 1.0 NaN
2 NaN 0.0
You can make use of this code snippet. I did not have time to perfect it with loops etc. so please make the change as per requirements.
import pandas as pd
import numpy as np
# Data
print(df.head())
a b c
0 0 7 6
1 0 0 1
2 0 2 2
cp = df.copy()
cp[cp != 0] = 1
cp['comp1'] = cp['a'] + cp['b']
cp['comp2'] = cp['b'] + cp['c']
# Logic
cp = cp.replace([0, 1, 2], [1, np.nan, 0])
cp[['a', 'b', 'c']] = df[['a', 'b', 'c']]
# Results
print(cp.head())
a b c comp1 comp2
0 0 7 6 NaN 0.0
1 0 0 1 1.0 NaN
2 0 2 2 NaN 0.0