pandas groupby column name part? [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
How to group a dataframe by column name parts and then plot the pairs with one command?
import pandas as pd
import numpy as np
dataframe = pd.DataFrame(np.random.randn(5,5),columns=['2678_namex', '2354_namey', '2396_namex', '2398_namez', '2368_namey'] )
It should be the following groups:
[2678_namex , 2396_namex]
[2354_namey , 2368_namey]
regards

Are you looking for something like this ?
df.columns = list(map(lambda x:x.split('_')[1],df.columns))<br>
df.T.groupby(by=df.columns).sum()

Related

Python : for loop for multi variables [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 12 months ago.
Improve this question
It seems a simple question, but I'm new to Python.
I have 10 variables (those names are from A to J), those variables are float32 np.arrays. I want to apply the following command :
variable = variable*mask[0,:,:]; variable[variable==0] = np.nan
On all variables, just in one line rather than writing 10 lines, taking into account keeping variables names the same.
Psuedocode exmaple
FOR all variables A-J
variable = variable*mask[0,:,:]; variable[variable==0] = np.nan
ENDFOR
You can do something like this?
variables = [a,b,c]
for i in range(len(variables)):
x = variables[i]
variables[i] = x*mask[0,:,:]; x[x==0] = np.nan
note: this just updates the items in the list

python pandas groupby detail and sum [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
here is my target output,
I want to show groupby sum() by pandas
But I can only print sum() or detail
How can I combine this two output
pic 1 is my code , pic 2 is what I expect.
Thanks everyone.
enter image description here
enter image description here
First do group by on 'area' and then do sum() on that group using function
def get_total(x):
x.loc['total'] = ['','',x['amount'].sum()]
return x
df.groupby(['area']).apply(get_total)

Pandas Error- None of Index are in the columns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 months ago.
Improve this question
Problem Statement
I have the CSV data as shown in the image. From this, I have to use only keep RegionName, State and the quarterly mean values from 2000 - 2016. Also, I want to use multi-indexing with [State, RegionName].
I am working on a CSV file with pandas in python. As shown in the screenshot.
Thank you in advance.
Right before the troublesome for year in range(...) loop, you did:
house_data.columns = pd.to_datetime(house_data.columns).to_period('M')
That means your columns are no longer strings. So inside the for loop:
house_data[str(year)+'q2'] = house_data[[str(year)+'-04',...]].mean(axis=1)
would fail and throw that error since there are no column with string name. To fix this, do this instead:
house_data.columns = pd.to_datetime(house_data.columns).to_period('M').strftime('%Y-%m')
However, you are better do:
house_data.columns = pd.to_datetime(house_data.columns).to_period('Q')
house_data.groupby(level=0, axis=1).mean()

How to assign names to a dataframe in Pyspark [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
At every step shall I be introducing a new variable name or I can continue to use the same name. Kindly advise what's the best practice and why?
df1 = df.withColumn('last_insert_timestamp', lit(datetime.now())
df2 = df1.withColumn('process_date', lit(rundate)
Versus
df = df.withColumn('last_insert_timestamp', lit(datetime.now())
df = df.withColumn('process_date', lit(rundate)
There is no best practice for that. It depends on what you want to do.
In Python, variables are just labels assigned to an object. So if you need your original DF object to be modified through your code then change the assignment to the newly generated DF.
Now, if you need to keep the first DF for other processing later in the code, then you may assign a new variable name.
You might find more explanations here: Reassigning Variables in Python
You can use like this
df = df.withColumn('last_insert_timestamp', lit(datetime.now()) \
.withColumn('process_date', lit(rundate)

How to reduce the time complexity of this nested loop code in python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
Please help me reduce the time complexity of the nested loop in Python
df is a dataframe with say 3 columns, say name, city and date for eg
rep data frame has the average/means based on 2 columns name and city from df. I need to reattach the mean from rep to df
for i in range(0,len(rep)):
for j in range(k,len(df)):
if df["X"][j] == rep["X"][i]:
df["Mean"][j] = rep["Mean"][i]
else:
k=j
break
What you want is something like:
df.set_index('X').join(rep.set_index('X'))
Setting as index the keys on which you are doing the join will make the process much faster. After you have done the join, you can filter the old mean (with the drop dataframe method), and the values that you don't want

Categories