The problem is I am trying to make a ranking for every 3 cells in that column
using pandas.
For example:
This is the outcome I want
I have no idea how to make it.
I tried something like this:
for i in range(df.iloc[1:],df.iloc[,:],3):
counter = 0
i['item'] += counter + 1
The code is completely wrong, but I need help with the range and put df.iloc in the brackets in pandas.
Does this match the requirements ?
import pandas as pd
df = pd.DataFrame()
df['Item'] = ['shoes','shoes','shoes','shirts','shirts','shirts']
df2 = pd.DataFrame()
for i, item in enumerate(df['Item'].unique(), 1):
df2.loc[i-1,'rank'] = i
df2.loc[i-1, 'Item'] = item
df2['rank'] = df2['rank'].astype('int')
print(df)
print("\n")
print(df2)
df = df.merge(df2, on='Item', how='inner')
print("\n")
print(df)
Related
I've been struggling with the following issue that sounds very easy in fact but can't seem to figure it out and I'm sure it's something very obvious in the stacktrace but I'm just being dumb.
I simply have a pandas dataframe looking like this:
And want to drop the rows that contain, in the jpgs cell value (list), the value "123.jpg". So normally I would get the final dataframe with only rows of index 1 and 3.
However I've tried a lot of methods and none of them works.
For example:
df = df["123.jpg" not in df.jpgs]
or
df = df[df.jpgs.tolist().count("123.jpg") == 0]
give error KeyError: True:
df = df[df['jpgs'].str.contains('123.jpg') == False]
Returns an empty dataframe:
df = df[df.jpgs.count("123.jpg") == 0]
And
df = df.drop(df["123.jpg" in df.jpgs].index)
Gives KeyError: False:
This is my entire code if needed, and I would really appreciate if someone would help me with an answer to what I'm doing wrong :( . Thanks!!
import pandas as pd
df = pd.DataFrame(columns=["person_id", "jpgs"])
id = 1
pair1 = ["123.jpg", "124.jpg"]
pair2 = ["125.jpg", "300.jpg"]
pair3 = ["500.jpg", "123.jpg"]
pair4 = ["111.jpg", "122.jpg"]
row1 = {'person_id': id, 'jpgs': pair1}
row2 = {'person_id': id, 'jpgs': pair2}
row3 = {'person_id': id, 'jpgs': pair3}
row4 = {'person_id': id, 'jpgs': pair4}
df = df.append(row1, ignore_index=True)
df = df.append(row2, ignore_index=True)
df = df.append(row3, ignore_index=True)
df = df.append(row4, ignore_index=True)
print(df)
#df = df["123.jpg" not in df.jpgs]
#df = df[df['jpgs'].str.contains('123.jpg') == False]
#df = df[df.jpgs.tolist().count("123.jpg") == 0]
df = df.drop(df["123.jpg" in df.jpgs].index)
print("\n Final df")
print(df)
Since you filter on a list column, apply lambda would probably be the easiest:
df.loc[df.jpgs.apply(lambda x: "123.jpg" not in x)]
Quick comments on your attempts:
In df = df.drop(df["123.jpg" in df.jpgs].index) you are checking whether the exact value "123.jpg" is contained in the column ("123.jpg" in df.jpgs) rather than in any of the lists, which is not what you want.
In df = df[df['jpgs'].str.contains('123.jpg') == False] goes in the right direction, but you are missing the regex=False keyword, as shown in Ibrahim's answer.
df[df.jpgs.count("123.jpg") == 0] is also not applicable here, since count returns the total number of non-NaN values in the Series.
For str.contains one this is how it is done
df[df.jpgs.str.contains("123.jpg", regex=False)]
You can try this:
mask = df.jpgs.apply(lambda x: '123.jpg' not in x)
df = df[mask]
I have CSV file and I try to split my row into many rows if it contains more than 4 columns
Example:-
enter image description here
Expected Output:
enter image description here
So there are way to do that in pandas or python
Sorry if this is a simple question
When there are two columns with the same name in CSV file, the pandas dataframe automatically appends an integer value to the duplicate column name
for example:
This CSV file :
Will become this :
df = pd.read_csv("Book1.csv")
df
Now to solve your question, lets consider the above dataframe as the input dataframe.
Try this :
cols = df.columns.tolist()
cols.remove('id')
start = 0
end = 4
new_df = []
final_cols = ['id','x1','y1','x2','y2']
while start<len(cols):
if end>len(cols):
end = len(cols)
temp = cols[start:end]
start = end
end = end+4
temp_df = df.loc[:,['id']+temp]
temp_df.columns = final_cols[:1+len(temp)]
if len(temp)<4:
temp_df[final_cols[1+len(temp):]] = None
print(temp_df)
new_df.append(temp_df)
pd.concat(new_df).reset_index(drop = True)
Result:
You can first set the video column as index then concat your remaining every 4 columns into a new dataframe. At last, reset index to get video column back.
df.set_index('video', inplace=True)
dfs = []
for i in range(len(df.columns)//4):
d = df.iloc[:, range(i*4,i*4+4)]
dfs.append(d.set_axis(['x_center', 'y_center']*2, axis=1))
df_ = pd.concat(dfs).reset_index()
I think the following list comprehension should work, but it gives an positional indexing error on my machine and I don't know why
df_ = pd.concat([df.iloc[: range(i*4, i*4+4)].set_axis(['x_center', 'y_center']*2, axis=1) for i in range(len(df.columns)//4)])
print(df_)
video x_center y_center x_center y_center
0 1_1 31.510973 22.610222 31.383655 22.488293
1 1_1 31.856295 22.830109 32.016905 22.948702
2 1_1 32.011684 22.990689 31.933356 23.004779
I found this Pandas Dataframe - select columns with a specific value in a specific row
But I couldn't figure out how to do this in iteration from user input.
no_to_search = input()
with open('jen.csv', 'rt') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
if no_to_search==row
df=pd.read_csv('jen.csv')
for col in df.columns:
if (df[col] == 1).any():
print(col)
[refer image via this link 1]
I want to retrieve column names with value 1 for the input row matching.
OK, I think I have finally fully understood your problem.
If I have understood your question correctly,
you want to retrieve the columns where the value of a certain row matches your condition (value = 1)
Example code:
>> import pandas as pd
>> df = pd.DataFrame([[1,0,1],[0,1,0],[1,1,0]], columns=["car_db","bike_db","van_db"], index=["7602102000", "7602201132", "7622315645"])
>> df
car_db bike_db van_db
7602102000 1 0 1
7602201132 0 1 0
7622315645 1 1 0
>> df.to_csv("tmp.csv")
The example data is in a cvs file called "tmp.csv" and is exactly like your example.
You want the name of the columns, where the phone number is 7602102000 and the value is 1:
>> df = pd.read_csv("tmp.csv", index_col=0)
>> no = 7602102000 # phone number inputed by your users
>> col_names = df.columns[(df.loc[df.index == no].values==1)[0]]
>> col_names
Index(['car_db', 'van_db'], dtype='object')
>> df[col_names]
car_db van_db
7602102000 1 1
7602201132 0 0
7622315645 1 0
This will print you all columns where the condition is met.
Here's a much more efficient way of doing this. Changed the code to suit your question. Add another condition to the lambda.
df.apply(lambda x: print(x.index[(x['phone no'] == number) & (x.isin([1])].values), axis = 1)
If you want to add a column to your data frame:
df['cols'] = df.apply(lambda x: print(x.index[(x['phone no'] == number) & (x.isin([1])].values), axis = 1)
I am sure there's a much efficient way of doing this. This should work. Let me know if it doesn't.
for i in range(len(df)):
if df.columns[(df == 1).iloc[i]].notna():
print(df.columns[(df == 1).iloc[i]].values)
Every time I creat a loop function, it's common to have problem with the first one:
For example:
dfd = quandl.get("FRED/DEXBZUS")
dfe = quandl.get("ECB/EURBRL")
df = [dfd, dfe]
dps = []
for i in df:
I just get the second dataframe values.
Using this:
dfd = quandl.get("FRED/DEXBZUS")
df = [dfd]
dps = []
for i in df:
I got this:
Empty DataFrame
Columns: []
Index: []
And if I use this (repeting the first one):
dfd = quandl.get("FRED/DEXBZUS")
dfe = quandl.get("ECB/EURBRL")
df = [dfd, dfd, dfe]
dps = []
for i in df:
I get both dataframes correcly
Examples :
import quandl
import pandas as pd
#import matplotlib
import matplotlib.pyplot as plt
dfd = quandl.get("FRED/DEXBZUS")
dfe = quandl.get("ECB/EURBRL")
df = [dfd, dfe]
dps = []
for i in df:
df1 = i.reset_index()
results = pd.DataFrame(df1)
results = results.rename(columns={'Date': 'ds','Value': 'y'})
dps = pd.DataFrame(dps.append(results))
print(dps)
Empty DataFrame
Columns: []
Index: []
ds y
0 2008-01-02 2.6010
1 2008-01-03 2.5979
2 2008-01-04 2.5709
3 2008-01-07 2.6027
4 2008-01-08 2.5796
UPDATE
As Bruno suggested, it is related to this function:
dps = pd.DataFrame(dps.append(results))
How to append all the dataset into a one data frame ?
result=Pd.DataFrame(df1) If you create dataframe like this and don't give columns, then by default first it will take 1st row as column and later you are renaming columns what default created.
So please create pd.DataFrame(df1,columns=[column_list]).
First row will not skip.
#this will print every element in df
for i in df:
print i
Also,
for dfIndex, i in enumerate(df):
print i
print dfIndex #this will print the index of i in df
Note that indexes start at 0, not 1.
I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.