unable to concat the output for multiple rows - python

I have a dataframe which is like below
If i write a code like below
df.iloc[0]
And if i write code like below
df.iloc[3]
I want to concat all the df.iloc[0], df.iloc1, df.iloc2 untill whatever the max rows are are present. But with the help of for loop i'm unable to. Can anyone help me with this?

Use concat with comprehension:
df1 = pd.concat((df.loc[i] for i in df.index))
Or:
df1 = pd.concat((df.iloc[i] for i in range(len(df.index))))

Related

Unmerge cells when using groupby (PANDAS)

I grouped some data using groupby:
df1['N° of options'] = df.groupby(['Country','Restaurant']).Food.size()
And the result is a dataframe with grouped arguments merged, instead of it I'd like to repeat these values along the cells.
Any clue about how can I display data like this?
For now, I got something like this:
Thank you!!
Assuming that grouped_df is your grouped dataframe, you can use pandas.DataFrame.reset_index to fill down the rows of your two indexes.
>>> print(grouped_df)
>>> print(grouped_df.reset_index())
Another way to do it is to add as_index=False argument to your groupyby clause :
grouped_df = df.groupby(['SG_UF', 'SG_PARTIDO'], as_index=False).sum()
If I understand correctly, you are trying to sort instead of groupby as you have mentioned you want to see the values.
sort works like df_name.sort_values(by column_name, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’, ignore_index=False, key=None)
In your code, it could look like:
df.sort_values(by = ['Country','Restaurant']). Use other arguments as required, like, order of sort, etc.

Pandas .stack() issue

I have been trying to use pandas to do a simple stack and it seems I am missing something.
I have a csv file in this format
I thought I would use stack to get this
The number of columns and number of items will vary
df = pd.read_csv("z-textsource.csv")
data_stacked = df.stack()
data_stacked.to_csv("z-textsource_stacked.csv")
However, when I run the code I get this
Many thanks in advance!
item column is not index now. Please try:
df = pd.read_csv("z-textsource.csv", index_col=0)
And then the same code you use

Filling a dataframe with multiple dataframe values

I have some 100 dataframes that need to be filled in another big dataframe. Presenting the question with two dataframes
import pandas as pd
df1 = pd.DataFrame([1,1,1,1,1], columns=["A"])
df2 = pd.DataFrame([2,2,2,2,2], columns=["A"])
Please note that both the dataframes have same column names.
I have a master dataframe that has repetitive index values as follows:-
master_df=pd.DataFrame(index=df1.index)
master_df= pd.concat([master_df]*2)
Expected Output:-
master_df['A']=[1,1,1,1,1,2,2,2,2,2]
I am using for loop to replace every n rows of master_df with df1,df2... df100.
Please suggest a better way of doing it.
In fact df1,df2...df100 are output of a function where the input is column A values (1,2). I was wondering if there is something like
another_df=master_df['A'].apply(lambda x: function(x))
Thanks in advance.
If you want to concatenate the dataframes you could just use pandas concat with a list as the code below shows.
First you can add df1 and df2 to a list:
df_list = [df1, df2]
Then you can concat the dfs:
master_df = pd.concat(df_list)
I used the default value of 0 for 'axis' in the concat function (which is what I think you are looking for), but if you want to concatenate the different dfs side by side you can just set axis=1.

Adding comuln/s if not existing using Pandas

I am using Pandas with PsychoPy to reorder my results in a dataframe. The problem is that the dataframe is going to vary according to the participant performance. However, I would like to have a common dataframe, where non-existing columns are created as empty. Then the columns have to be in a specific order in the output file.
Let´s suppose I have a dataframe from a participant with the following columns:
x = ["Error_1", "Error_2", "Error_3"]
I want the final dataframe to look like this:
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
Where "Error_4" is created as an empty column.
I tried applying something like this (adapted from another question):
if "Error_4" not in x:
x["Error_4"] = ""
x = x[["Error_1", "Error_2", "Error_3", "Error_4"]]
In principle it should work, however I have more or less other 70 columns for which I should do this, and it doesn´t seem practical to do it for each of them.
Do you have any suggestions?
I also tried creating a new dataframe with all the possible columns, e.g.:
y = ["Error_1", "Error_2", "Error_3", "Error_4"]
However, it is still not clear to me how to merge the dataframes x and y skipping columns with the same header.
Use DataFrame.reindex:
x = x.reindex(["Error_1", "Error_2", "Error_3", "Error_4"], axis=1, fill_value='')
Thanks for the reply, I followed your suggestion and adapted it. I post it here, since it may be useful for someone else.
First I create a dataframe y as I want my output to look like:
y = ["Error_1", "Error_2", "Error_3", "Error_4", "Error_5", "Error_6"]
Then, I get my actual output file df and modify it as df2, adding to it all the columns of y in the exact same order.
df = pd.DataFrame(myData)
columns = df.columns.values.tolist()
df2 = df.reindex(columns = y, fill_value='')
In this case, all the columns that are absent in df2 but are present in y, are going to be added to df2.
However, let´s suppose that in df2 there is a column "Error_7" absent in y. To keep track of these columns I am just applying merge and creating a new dataframe df3:
df3 = pd.merge(df2,df)
df3.to_csv(filename+'UPDATED.csv')
The missing columns are going to be added at the end of the dataframe.
If you think this procedure might have drawbacks, or if there is another way to do it, let me know :)

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

Categories