Pandas: updating values based on value from run time [duplicate] - python

This question already has answers here:
Replace Column in Data Frame from Lookup of other Data Frame
(2 answers)
Closed 4 years ago.
I have a dataframe like this:
df1:
Steam feat
1 some_value
2 some_value
3 some_value
4 some_value
I have to update the value in "feat" based on certain condition. For example,
i have to update the feat as "88" when the steam is "2"
The output should look like this:
final output:
Steam feat
1 some_value
2 88
3 some_value
4 some_value
The issue i have here is that, i have to pass the values "2" and "88" in run time values taken from a different table called df2.
df2:
cola colb
2 88
To achieve this, I tried to apply the below code:
df1.loc[df1["Steam"] = df2.cola.values, 'feat'] = df2.colb.values
However i am getting a "invalid syntax" error
the values of df2.cola.values will look like this
array(['2'], dtype=object)
Am I doing anything wrong here? Please advise.

You need to align indices and map your data. This is one way, which should be efficient if you expect a mapping to exist.
df1['feat'] = df1['Steam'].map(df2.set_index('cola')['colb']).fillna(df1['feat'])

Related

How to filter pandas dataframe based on length of a list in a column? [duplicate]

This question already has answers here:
How to filter a pandas dataframe based on the length of a entry
(2 answers)
Closed 1 year ago.
I have a pandas DataFrame like this:
id subjects
1 [math, history]
2 [English, Dutch, Physics]
3 [Music]
How to filter this dataframe based on the length of the column subjects?
So for example, if I only want to have rows where len(subjects) >= 2?
I tried using
df[len(df["subjects"]) >= 2]
But this gives
KeyError: True
Also, using loc does not help, that gives me the same error.
Thanks in advance!
Use the string accessor to work with lists:
df[df['subjects'].str.len() >= 2]
Output:
id subjects
0 1 [math, history]
1 2 [English, Dutch, Physics]

How to do arithmetic on a Python DataFrame using instructions held in another DataFrame?

I asked this question for R a few months back and got a great answer that I used often. Now I'm trying to transition to Python but I was dreading attempting rewriting this code snippet. And now after trying I haven't been able to translate the answer I got (or find anything similar by searching).
The question is: I have a dataframe that I'd like to append new columns to where the calculation is dependent on values in another dataframe which holds the instructions.
I have created a reproducible example below (although in reality there are quite a few more columns and many rows so speed is important and I'd like to avoid a loop if possible):
input dataframes:
import pandas as pd;
data = {"A":["orange","apple","banana"],"B":[5,3,6],"C":[7,12,4],"D":[5,2,7],"E":[1,18,4]}
data_df = pd.DataFrame(data)
key = {"cols":["A","B","C","D","E"],"include":["no","no","yes","no","yes"],"subtract":["na","A","B","C","D"],"names":["na","G","H","I","J"]}
key_df = pd.DataFrame(key)
desired output (same as data but with 2 new columns):
output = {"A":["orange","apple","banana"],"B":[5,3,6],"C":[7,12,4],"D":[5,2,7],"E":[1,18,4],"H":[2,9,-2],"J":[-4,16,-3]}
output_df= pd.DataFrame(output)
So, the key dataframe has 1 row for each column in the base dataframe and it has an "include" column that has to be set to "yes" if any calculation is to be done. When it is set to "yes", then I want to add a new column with a defined name that subtracts a defined column (all lookups from the key dataframe).
For example, column "C" in the base dataframe is included so I want to create a new column called "H" which is the the value from column "C" minus the value from column "B".
p.s. here was the answer from R in case that triggers any thought processes for someone better skillled than me!
k <- subset(key, include == "yes")
output <- cbind(base,setNames(base[k[["cols"]]]-base[k[["subtract"]]],k$names))
Filter for the yes values in include:
yes = key_df.loc[key_df.include.eq("yes"), ["cols", "subtract", "names"]]
cols subtract names
2 C B H
4 E D J
Create a dictionary of the yes values and unpack it in the assign method::
yes_values = { name: data_df[col] - data_df[subtract]
for col, subtract, name
in yes.to_numpy()}
data_df.assign(**yes_values)
A B C D E H J
0 orange 5 7 5 1 2 -4
1 apple 3 12 2 18 9 16
2 banana 6 4 7 4 -2 -3

Can I update the value of a column based on the same column value in a python dataframe?

I have a dataframe to capture characteristics of people accessing a webpage. The list of time spent by each user in the page is one of the characteristic feature that I get as an input. I want to update this column with maximum value of the list. Is there a way in which I can do this?
Assume that my data is:
df = pd.DataFrame({Page_id:{1,2,3,4}, User_count:{5,3,3,6}, Max_time:{[45,56,78,90,120],[87,109,23],[78,45,89],[103,178,398,121,431,98]})
What I want to do is convert the column Max_time in df to Max_time:{120,109,89,431}
I am not supposed to add another column for computing the max separately as this table structure cannot be altered.
I tried the following:
for i in range(len(df)):
df.loc[i]["Max_time"] = max(df.loc[i]["Max_time"])
But this is not changing the column as I intended it to. Is there something that I missed?
df = pd.DataFrame({'Page_id':[1,2,3,4],'User_count':[5,3,3,6],'Max_time':[[45,56,78,90,120],[87,109,23],[78,45,89],[103,178,398,121,431,98]]})
df.Max_time = df.Max_time.apply(max)
Result:
Page_id User_count Max_time
0 1 5 120
1 2 3 109
2 3 3 89
3 4 6 431
You can use this:
df['Max_time'] = df['Max_time'].map(lambda x: np.max(x))

Columns in Pandas Dataframe [duplicate]

This question already has answers here:
Binning a column with pandas
(4 answers)
Closed 3 years ago.
I have a dataframe of cars. I have its car price column and I want to create a new column carsrange that would have values like 'high','low' etc according to car price. Like for example :
if price is between 0 and 9000 then cars range should have 'low' for those cars. similarly, if price is between 9000 and 30,000 carsrange should have 'medium' for those cars etc. I tried doing it, but my code is replacing one value to the other. Any help please?
I ran a for loop in the price column, and use the if-else iterations to define my column values.
for i in cars_data['price']:
if (i>0 and i<9000): cars_data['carsrange']='Low'
elif (i<9000 and i<18000): cars_data['carsrange']='Medium-Low'
elif (i<18000 and i>27000): cars_data['carsrange']='Medium'
elif(i>27000 and i<36000): cars_data['carsrange']='High-Medium'
else : cars_data['carsrange']='High'
Now, When I run the unique function for carsrange, it shows only 'High'.
cars_data['carsrange'].unique()
This is the Output:
In[74]:cars_data['carsrange'].unique()
Out[74]: array(['High'], dtype=object)
I believe I have applied the wrong concept here. Any ideas as to what I should do now?
you can use list:
resultList = []
for i in cars_data['price']:
if (i>0 and i<9000):
resultList.append("Low")
else:
resultList.append("HIGH")
# write other conditions here
cars_data["carsrange"] = resultList
then find uinque values from cars_data["carsrange"]

Merging dataframes together in a for loop [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a dictionary of pandas dataframes, each frame contains timestamps and market caps corresponding to the timestamps, the keys of which are:
coins = ['dashcoin','litecoin','dogecoin','nxt']
I would like to create a new key in the dictionary 'merge' and using the pd.merge method merge the 4 existing dataframes according to their timestamp (I want completed rows so using 'inner' join method will be appropriate.
Sample of one of the data frames:
data2['nxt'].head()
Out[214]:
timestamp nxt_cap
0 2013-12-04 15091900
1 2013-12-05 14936300
2 2013-12-06 11237100
3 2013-12-07 7031430
4 2013-12-08 6292640
I'm currently getting a result using this code:
data2['merged'] = data2['dogecoin']
for coin in coins:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp')
but this repeats 'dogecoin' in 'merged', however if data2['merged'] is not = data2['dogecoin'] (or some similar data) then the merge function won't work as the values are non existent in 'merge'
EDIT: my desired result is create one merged dataframe seen in a new element in dictionary 'data2' (data2['merged']), containing the merged data frames from the other elements in data2
Try replacing the generalized pd.merge() with actual named df but you must begin dataframe with at least a first one:
data2['merged'] = data2['dashcoin']
# LEAVE OUT FIRST ELEMENT
for coin in coins[1:]:
data2['merged'] = data2['merged'].merge(data2[coin], on='timestamp')
Since you've already made coins a list, why not just something like
data2['merged'] = data2[coins[0]]
for coin in coins[1:]:
data2['merged'] = pd.merge(....
Unless I'm misunderstanding, this question isn't specific to dataframes, it's just about how to write a loop when the first element has to be treated differently to the rest.

Categories