I have CSV file and I try to split my row into many rows if it contains more than 4 columns
Example:-
enter image description here
Expected Output:
enter image description here
So there are way to do that in pandas or python
Sorry if this is a simple question
When there are two columns with the same name in CSV file, the pandas dataframe automatically appends an integer value to the duplicate column name
for example:
This CSV file :
Will become this :
df = pd.read_csv("Book1.csv")
df
Now to solve your question, lets consider the above dataframe as the input dataframe.
Try this :
cols = df.columns.tolist()
cols.remove('id')
start = 0
end = 4
new_df = []
final_cols = ['id','x1','y1','x2','y2']
while start<len(cols):
if end>len(cols):
end = len(cols)
temp = cols[start:end]
start = end
end = end+4
temp_df = df.loc[:,['id']+temp]
temp_df.columns = final_cols[:1+len(temp)]
if len(temp)<4:
temp_df[final_cols[1+len(temp):]] = None
print(temp_df)
new_df.append(temp_df)
pd.concat(new_df).reset_index(drop = True)
Result:
You can first set the video column as index then concat your remaining every 4 columns into a new dataframe. At last, reset index to get video column back.
df.set_index('video', inplace=True)
dfs = []
for i in range(len(df.columns)//4):
d = df.iloc[:, range(i*4,i*4+4)]
dfs.append(d.set_axis(['x_center', 'y_center']*2, axis=1))
df_ = pd.concat(dfs).reset_index()
I think the following list comprehension should work, but it gives an positional indexing error on my machine and I don't know why
df_ = pd.concat([df.iloc[: range(i*4, i*4+4)].set_axis(['x_center', 'y_center']*2, axis=1) for i in range(len(df.columns)//4)])
print(df_)
video x_center y_center x_center y_center
0 1_1 31.510973 22.610222 31.383655 22.488293
1 1_1 31.856295 22.830109 32.016905 22.948702
2 1_1 32.011684 22.990689 31.933356 23.004779
Related
I want to get a file from the csv file formatted as follows:
CSV file:
Desired output txt file (Header italicized):
MA Am1 Am2 Am3 Am4
MX1 X Y - -
MX2 9 10 11 12
Any suggestions on how to do this? Thank you!
Need help with writing the python code for achieving this. I've tried to loop through every row, but still struggling to find a way to write this.
You can try this.
Based on unique MA value groups, get the values [names column here]
Create a new dataframe with it.
Expand the values list to columns and add it to new dataframe.
Copy name column from first data frame.
Reorder 'name' column.
Code:
import pandas as pd
df = pd.DataFrame([['MX1', 1, 222],['MX1', 2, 222],['MX2', 4, 44],['MX2', 3, 222],['MX2', 5, 222]], columns=['name','values','etc'])
df_new = pd.DataFrame(columns = ['name', 'values'])
for group in df.groupby('name'):
df_new.loc[-1] = [group[0], group[1]['values'].to_list()]
df_new.index = df_new.index + 1
df_new = df_new.sort_index()
df_expanded = pd.DataFrame(df_new['values'].values.tolist()).add_prefix('Am')
df_expanded['name'] = df_new['name']
cols = df_expanded.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_expanded = df_expanded[cols]
print(df_expanded.fillna('-'))
Output:
name Am0 Am1 Am2
0 MX2 4 3 5.0
1 MX1 1 2 -
I have a code which iterate through excel and extract values from excel columns as loaded as list in dataframe. When I write dataframe to excel, I am seeing data with in [] and quotes for string ['']. How can I remove [''] when I write to excel.
Also I want to write only first value in product ID column to excel. how can I do that?
result = pd.DataFrame.from_dict(result) # result has list of data
df_t = result.T
writer = pd.ExcelWriter(path)
df_t.to_excel(writer, 'data')
writer.save()
My output to excel
I am expecting output as below and Product_ID column should only have first value in list
I tried below and getting error
path = path to excel
df = pd.read_excel(path, engine="openpyxl")
def data_clean(x):
for index, data in enumerate(x.values):
item = eval(data)
if len(item):
x.values[index] = item[0]
else:
x.values[index] = ""
return x
new_df = df.apply(data_clean, axis=1)
new_df.to_excel(path)
I am getting below error:
item = eval(data)
TypeError: eval() arg 1 must be a string, bytes or code object
df_t['id'] = df_t['id'].str[0] # this is a shortcut for if you only want the 0th index
df_t['other_columns'] = df_t['other_columns'].apply(lambda x: " ".join(x)) # this is to "unlist" the lists of lists which you have fed into a pandas column
This should be the effect you want, but you have to make sure that the data in each cell is ['', ...] form, and if it's different you can modify the way it's handled in the data_clean function:
import pandas as pd
df = pd.read_excel("1.xlsx", engine="openpyxl")
def data_clean(x):
for index, data in enumerate(x.values):
item = eval(data)
if len(item):
x.values[index] = item[0]
else:
x.values[index] = ""
return x
new_df = df.apply(data_clean, axis=1)
new_df.to_excel("new.xlsx")
The following is an example of df and modified new_df(Some randomly generated data):
# df
name Product_ID xxx yyy
0 ['Allen'] ['AF124', 'AC12414'] [124124] [222]
1 ['Aaszflen'] ['DF124', 'AC12415'] [234125] [22124124,124125]
2 ['Allen'] ['CF1sdv24', 'AC12416'] [123544126] [33542124124,124126]
3 ['Azdxven'] ['BF124', 'AC12417'] [35127] [333]
4 ['Allen'] ['MF124', 'AC12418'] [3528] [12352324124,124128]
5 ['Allen'] ['AF124', 'AC12419'] [122359] [12352324124,124129]
# new_df
name Product_ID xxx yyy
0 Allen AF124 124124 222
1 Aaszflen DF124 234125 22124124
2 Allen CF1sdv24 123544126 33542124124
3 Azdxven BF124 35127 333
4 Allen MF124 3528 12352324124
5 Allen AF124 122359 12352324124
I need to insert rows based on the column week based on the groupby type, in some cases i have missing weeks in the middle of the dataframe at different positions and i want to insert rows to fill in the missing rows as copies of the last existing row, in this case copies of week 7 to fill in the weeks 8 and 9 and copies of week 11 to fill in rows for week 12, 13 and 14 : on this table you can see the jump from week 7 to 10 and from 11 to 15:
the perfect output would be as follow: the final table with incremental values in column week the correct way :
Below is the code i have, it inserts only one row and im confused why:
def middle_values(final : DataFrame) -> DataFrame:
finaltemp= pd.DataFrame()
out= pd.DataFrame()
for i in range(0, len(final)):
for f in range(1, 52 , 1):
if final.iat[i,8]== f and final.iat[i-1,8] != f-1 :
if final.iat[i,8] > final.iat[i-1,8] and final.iat[i,8] != (final.iat[i-1,8] - 1):
line = final.iloc[i-1]
c1 = final[0:i]
c2 = final[i:]
c1.loc[i]=line
concatinated = pd.concat([c1, c2])
concatinated.reset_index(inplace=True)
concatinated.iat[i,11] = concatinated.iat[i-1,11]
concatinated.iat[i,9]= f-1
finaltemp = finaltemp.append(concatinated)
if 'type' in finaltemp.columns:
for name, groups in finaltemp.groupby(["type"]):
weeks = range(groups['week'].min(), groups['week'].max()+1)
out = out.append(pd.merge(finaltemp, pd.Series(weeks, name='week'), how='right').ffill())
out.drop_duplicates(subset=['project', 'week'], keep = 'first', inplace=True)
out.drop_duplicates(inplace = True)
out.sort_values(["Budget: Budget Name", "Budget Week"], ascending = (False, True), inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
out.reset_index(inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
return out
else :
return final
For the first part of your question. Suppose we have a dataframe like the following:
df = DataFrame({"project":[1,1,1,2,2,2], "week":[1,3,4,1,2,4], "value":[12,22,18,17,18,23]})
We can create a new multi index to get the additional rows that we need
new_index = pd.MultiIndex.from_arrays([sorted([i for i in df['project'].unique()]*52),
[i for i in np.arange(1,53,1)]*df['project'].unique().shape[0]], names=['project', 'week'])
We can then apply this index to get the new dataframe that you need with blanks in the new rows
df = df.set_index(['project', 'week']).reindex(new_index).reset_index().sort_values(['project', 'week'])
You would then need to apply a forward fill (using ffill) or a back fill (using bfill) with groupby and transform to get the required values in the rows that you need.
I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?
Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem
This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''
IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.