I have two dataframes. Input_df1 is a concatenation of files with a 'filename' column, 'unitid', and many columns of data which I simplify as 'data' for this example. Input_df2 has 'filename' column in addition to 'group'. I'm trying to replicate the data in input_df1 per filename for each instance of filename in input_df2 and add a group column in the output dataframe showing which group it belongs to.
import pandas as pd
input_df1 = pd.DataFrame(data={
'filename' : ['A', 'A', 'B', 'C'],
'unitid' : [ 1, 2, 3, 4 ],
'data' : [11, 12, 13, 14 ]
})
input_df2 = pd.DataFrame(data={
'filename' : ['A', 'B', 'C', 'C', 'A' ],
'group' : ['g1', 'g2', 'g3', 'g4', 'g5']
})
output_df = pd.DataFrame(data={
'filename' : [ 'A', 'A', 'A', 'A', 'B', 'C', 'C', 'A', 'A', 'A', 'A'],
'unitid' : [ 1, 2, 1, 2, 3, 4, 4, 1, 2, 1, 2],
'data' : [ 11, 12, 11, 12, 13, 14, 14, 11, 12, 11, 12],
'group' : ['g1', 'g1', 'g1', 'g1', 'g2', 'g3', 'g4', 'g5', 'g5', 'g5', 'g5']
})
Output_df is what I'm trying to create: replicated rows of the input_df1 per instance of filename in input_df2 with the 'group' value added.
Another question I have is, if I need to filter the rows of each replicated dataframe based on the group type, is it better to do that before joining or after? I was planning on filtering after since I have a better idea of how to do it, but I figure computing on unneeded rows is inefficient when they could be dropped during the replication process. Also I'm dealing with about 30k rows in input_df1 and 800 rows in input_df2 so potential for 24m rows total.
Any direction on which functions I should research to achieve this would be very appreciated.
Related
This question already has an answer here:
DataFrame.apply in python pandas alters both original and duplicate DataFrames
(1 answer)
Closed 3 months ago.
so here my issue :
# creation of dataframe
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index = labels)
if I run the code below the values of the column priorty change for the dataframe dg (ok good) but for the dataframe df too, why ??
# map function
dg = df
dg["priority"] = dg["priority"].map({"yes":True, "no":False})
print(dg)
simply use df.copy()
because df is a DataFrame instance and so an object and when you set it to another variable, in reality you point to one object with 2 variables and pandas does not create a new object
That's because pandas dataframes are mutables.
pandas.DataFrame
Two-dimensional, size-mutable, potentially
heterogeneous tabular data.
You want pandas.DataFrame.copy to keep the original dataframe (in your case df) unchanged.
# map function
dg = df.copy()
dg["priority"] = dg["priority"].map({"yes":True, "no":False})
When I serached a way to remove an entire column in pandas if there is a null/NaN value, the only appropriate function I found was dropna(). For some reason, it's not removing the entire row as intended, but instead replacing the null values with zero. As I want to discard the entire row to then make a mean age of the animals from the dataframe, I need a way to not count the NaN values.
Here's the code:
import numpy as np
import pandas as pd
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, labels)
df.dropna(inplace= True)
df.head()
In this case, I need to delete the Dog 'd' and Cat 'h'. But the code that comes out is:
To note I have also done this, and it didn't work either:
df2 = df.dropna()
you have to specify the axis = 1 and any to remove column
see : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
df.dropna(axis=1, inplace= True, how='any')
if you want just delet the row :
df.dropna(inplace= True, how='any')
I create my dataframe and pivot as follows:
df = pd.DataFrame({
'country': ['US','US','US','US','US','UK','UK','UK','UK','UK','IT','IT','IT','IT','IT'],
'dimension': ['a', 'b', 'c', 'd','e','a', 'b', 'c', 'd','e','a', 'b', 'c', 'd','e'],
'count': [29, 10, 9, 34,29, 10, 9, 34,29, 10, 9, 34,15,17,18],
})
Pivot:
pivoted = df.pivot_table(index='country', columns='dimension', values='count')
Now I want to get the values of the first row of pivot to list? How to do?
Output in a list should be: 9, 34, 15, 17, 18
I tried iloc but did not succeed.
print(pivoted.iloc[0,:].tolist()) # [9, 34, 15, 17, 18]
I'm a new python user familiar with R.
I want to calculate user-defined quantiles for groups complete with the count of observations in each group.
In R I would do:
df_sum <- df %>% group_by(group) %>%
dplyr::summarise(q85 = quantile(obsval, probs = 0.85, type = 8),
n = n())
In python I can get the grouped percentile by:
df_sum = df.groupby(['group'])['obsval'].quantile(0.85)
How do I add the group count to this?
I have tried:
df_sum = df.groupby(['group'])['obsval'].describe(percentile=[0.85])[[count]]
df_sum = df.groupby(['group'])['obsval'].quantile(0.85).describe(['count'])
Example data:
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df
Expected result:
group percentile count
A 7.4 5
B 6.55 4
You can use pandas.DataFrame.agg() to apply multiple functions.
In this case you should use numpy.quantile().
import pandas as pd
import numpy as np
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df_sum = df.groupby(['group'])['obsval'].agg([lambda x : np.quantile(x, q=0.85), "count"])
df_sum.columns = ['percentile', 'count']
print(df_sum)
Consider the following Python dictionary data and Python list labels:**
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Create a DataFrame birds from this dictionary data which has the index labels using Pandas
Assuming your dictionary is already ordered into the correct ordering for the labels
import pandas as pd
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
data['labels'] = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, columns=['birds', 'age', 'visits', 'priority', 'labels'])
df.set_index('labels')
Try the code below,
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {
'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no'],
'labels' : labels
}
df = pd.DataFrame.from_dict(data)
df.set_index('labels')
You can reduce some code like:
DataFrame provides us a flexibilty to provide some values like data,columns,index and the list goes on.
If we are dealing with Dictionary, then by default dictionaries keys are treated as column and values will be as rows.
In following Code I have used name attribute through DataFrame object
df=pd.DataFrame(data,index=Labels) # Custom indexes
df.index.name='labels' # After Running df.index.name you will get index as none, by this approach you can set any name to the column
I hope this will be help full for you.
Even I encountered the same exact issue few days back and we have a very beautiful library to handle dataframes and is better than pandas.
Search for turicreate in python, it is very very similar to the pandas but has a lot more to offer than pandas.
You can define the Sframes in Turienter image description here, somewhat similar to the pandas dataframe. After that you just have to run:
dataframe_name.show()
.show() visualizes any data structure in Turi Create.
You can visit the mentioned notebook for a better understanding: https://colab.research.google.com/drive/1DIFmRjGYx0UOiZtvMi4lOZmaBMnu_VlD
You can try out this.
import pandas as pd
import numpy as np
from pandas import DataFrame
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
df=DataFrame(data,index=labels)
print(df)