Sum DF columns stored in dictionary [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I got a dictionary with 3 Dataframes {0: DataFrame, 1: DataFrame, 2: DataFrame}.
Each DataFrame has the same size. 6 variables, 25 rows.
I'd like to sum all the values/rows from each DataFrame column 'Income' and pass the sum to a list.
Looking like this
list_of_sums = [Sum of income DF0, Sum of income DF1, Sum of income DF2]

Try this below :
list_of_sums = []
input_dict = {0: DataFrame, 1: DataFrame, 2: DataFrame}
for obj in input_dict:
list_of_sums.append(input_dict[obj]['Income'].sum()) # use sum function and append the result to the output list.
print(list_of_sums)

Not the most optimized one but this works.
import pandas as pd
d, dd, ddd = pd.DataFrame(),pd.DataFrame(),pd.DataFrame()
d['Income'] = [3,4,5,6]
dd['Income'] = [30,40,50,60]
ddd['Income'] = [300,400,500,600]
dicts = [d,dd,ddd]
for dc in dicts:
print(sum([s for s in dc['Income']]))

Related

How can I generate a random dataset in approximately equal proportions [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
How can I generate a random dataset with 8 columns and 50,000 rows. Each column must be a categorical variable (breed of dog) with 3 levels (example, colour), resulting in approximately equal proportions using Python. How many unique rows (i.e., permutations of category levels) are possible?
data = np.empty(50000, 8)
data[:,0] = [np.random.choice(col0_options) for i in range(50000)]
# do this for all columns ...
data[:,7] = [np.random.choice(col7_options) for i in range(50000)]
If you want to make it less hardcoded you could store every list of options in a dict like {0: col0_options, ..., 7: col7_options} and then do:
options = {0: col0_options, ..., 7: col7_options}
data = np.empty(50000, 8)
for i in range(8):
data[:,i] = [np.random.choice(options[i]) for i in range(50000)]
Since there are 8 cols and 3 values per col there are 3^8 unique possibilities.

How to efficiently calculate co-existence of elements in pandas [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a dataframe in which each row show one transaction, items purchased together. Here is how my dataframe looks like
items
['A','B','C']
['A','C]
['C','F']
...
I need to create a dictionary which shows how many times items have been purchased together, something like below
{'A':[('B',1),('C':5)], 'B': [('A':1),('C':6)], ...}
Right now, I have defined a variable freq and then loop through my dataframe and calculate/update the dictionary (freq). but it's taking very long.
What's the efficient way of calculating this without looping through the dataframe?
You can speed this up with sklearn's MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
Transform your data using:
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['items']),
columns=mlb.classes_,
index=df.index)
to get it in the following format:
A B C F
0 1 1 1 0
1 1 0 1 0
2 0 0 1 1
And then getting you can define a trivial function like:
get_num_coexisting = lambda x, y: (df[x] & df[y]).sum()
And use as so:
get_num_coexisting('A', 'C')
>>> 2

User entry as column names in pandas [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have the user input two lists, one for sizes one for minutes they are each stored in a list. For example they can input sizes: 111, 121 and for minutes, 5, 10, 15.
I want to have the dataframe have columns that are named by the size and minute. (I did a for loop to extract each size and minute.) For example I want the columns to say 111,5 ; 111,10; 111;15, etc. I tried to do df[size+minute]=values (Values is data I want to input into each column) but instead the column name is just the values added up so I got the column name to be 116 instead of 111,5.
If you have two lists:
l = [111,121]
l2 = [5,10,15]
Then you can use list comprehension to form your column names:
col_names = [str(x)+';'+str(y) for x in l for y in l2]
print(col_names)
['111;5', '111;10', '111;15', '121;5', '121;10', '121;15']
And create a dataframe with these column names using pandas.DataFrame:
df = pd.DataFrame(columns=col_names)
If we add a row of data:
row = pd.DataFrame([[1,2,3,4,5,6]])
row.columns = col_names
df = df.append(pd.DataFrame(row))
We can see that our dataframe looks like this:
print(df)
111;5 111;10 111;15 121;5 121;10 121;15
0 1 2 3 4 5 6

Pandas new column with calculation based on other existing column [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a Panda and want to do a calculation based on an existing column.
However, the apply. function is not working for some reason.
It's something like letssay
df = pd.DataFrame({'Age': age, 'Input': input})
and the input column is something like [1.10001, 1.49999, 1.60001]
Now I want to add a new column to the Dataframe, that is doing the following:
Add 0.0001 to each element in column
Multiply each value by 10
Transform each value of new column to int
Use Series.add, Series.mul and Series.astype:
#input is python code word (builtin), so better dont use it like variable
inp = [1.10001, 1.49999, 1.60001]
age = [10,20,30]
df = pd.DataFrame({'Age': age, 'Input': inp})
df['new'] = df['Input'].add(0.0001).mul(10).astype(int)
print (df)
Age Input new
0 10 1.10001 11
1 20 1.49999 15
2 30 1.60001 16
You could make a simple function and then apply it by row.
def f(row):
return int((row['input']+0.0001)*10))
df['new'] = df.apply(f, axis=1)

Python :Select the rows for the most recent entry from multiple users [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a dataframe df with 3 columns :
df=pd.DataFrame({
'User':['A','A','B','A','C','B','C'],
'Values':['x','y','z','p','q','r','s'],
'Date':[14,11,14,12,13,10,14]
})
I want to create a new dataframe that will contain the rows corresponding to highest values in the 'Date' columns for each user. For example for the above dataframe I want the desired dataframe to be as follows ( its a jpeg image):
Can anyone help me with this problem?
This answer assumes that there is different maximum values per user in Values column:
In [10]: def get_max(group):
...: return group[group.Date == group.Date.max()]
...:
In [12]: df.groupby('User').apply(get_max).reset_index(drop=True)
Out[12]:
Date User Values
0 14 A x
1 14 B z
2 14 C s

Categories