Python - duplicated lines [duplicate] - python

This question already has answers here:
How to analyze all duplicate entries in this Pandas DataFrame?
(3 answers)
Closed 7 years ago.
I am new on Python. I would like to find the duplicated lines in a data frame.
To explain myself, I have the following data frame
type(data)
pandas.core.frame.DataFrame
data.head()
User Hour Min Day Month Year Latitude Longitude
0 0 1 48 17 10 2010 39.75000 -105.000000
1 0 6 2 16 10 2010 39.90625 -105.062500
2 0 3 48 16 10 2010 39.90625 -105.062500
3 0 18 25 14 10 2010 39.75000 -105.000000
I would like to find the duplicated lines in this data frame and to return the 'User' that corresponds to this line.
Thanks a lot,

Is this what you are looking for?
user = data[data.duplicated()]['User']

Related

Select Value from largest index for each year [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 10 months ago.
New to the Python world and I'm working through a problem where I need to pull a value for the largest index value for each year. Will provide a table example and explain further
Year
Index
D_Value
2010
13
85
2010
14
92
2010
15
76
2011
9
68
2011
10
73
2012
100
94
2012
101
89
So, the desired output would look like this:
Year
Index
D_Value
2010
15
76
2011
10
73
2012
101
89
I've tried researching how to apply max() and .loc() functions, however, I'm not sure what the optimal approach is for this scenario. Any help would be greatly appreciated. I've also included the below code to generate the test table.
import pandas as pd
data = {'Year':[2010,2010,2010,2011,2011,2012,2012],'Index':[13,14,15,9,10,100,101],'D_Value':[85,92,76,68,73,94,89]}
df = pd.DataFrame(data)
print(df)
You can use groupby + rank
df['Rank'] = df.groupby(by='Year')['Index'].rank(ascending=False)
print(df[df['Rank'] ==1])

How to plot average of values for a year

I have a data frame like so. I am trying to make a plot with the mean of 'number' for each year on the y and the year on the x. I think what I have to do to do this is make a new data frame with 2 columns 'year' and 'avg number' for each year. How would I go about doing that?
year number
0 2010 40
1 2010 44
2 2011 33
3 2011 32
4 2012 34
5 2012 56
When opening a question about pandas please make sure you following these guidelines: How to make good reproducible pandas examples. It will help us reproduce your environment.
Assuming your dataframe is stored in the df variable:
df.groupby('year').mean().plot()

Filtering by string giving me empty results [duplicate]

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 2 years ago.
I am asking for any other algorithm or method that you would use to detect anomalies on a single column.
Filtering by columns not showing the data.
I am using the following approach to limit my dataframe only to two columns
X=pd.read_csv(‘C:/Users/Path/file.csv’, usecols=[“Describe_File”, "numbers"])
Describe_File numbers
0 This is the start 25
1 Ending is coming 42
2 Middle of the story 525
3 This is the start 65
4 This is the start 25
5 Middle of the story 35
6 This is the start 28
7 This is the start 24
8 Ending is coming 24
9 Ending is coming 35
10 Ending is coming 25
11 Ending is coming 24
12 This is the start 215
Now I want to go to column ** Describe_File** , filter by the string This is the start and then show my the values of numbers
To do so I usually use the following code, by for some reason it is not giving me anything. The string exists on my csv file
X = X[X.Describe_File == "This is the start"]
You can use the .str.contains() - vectorized substring search, i.e.
df = X[X.Describe_File.str.contains("This is the start", regex=False)]

Get values on a big DataFrame Python [duplicate]

This question already has answers here:
pandas groupby where you get the max of one column and the min of another column
(2 answers)
Closed 4 years ago.
I have a big dataframe with a structure like this:
ID Year Consumption
1 2012 24
2 2012 20
3 2012 21
1 2013 22
2 2013 23
3 2013 24
4 2013 25
I want another DataFrame that contains first year of appearence, and max consumption of all time per ID like this:
ID First_Year Max_Consumption
1 2012 24
2 2012 23
3 2012 24
4 2013 25
Is there a way to extract this data without using loops? I have tried this:
year = list(set(df.Year))
ids = list(set(df.ID))
antiq = list()
max_con = list()
for i in ids:
df_id = df[df['ID'] == i]
antiq.append(min(df_id['Year']))
max_con.append(max(df_id['Consumption']))
But it's too slow. Thank you!
Use GroupBy + agg:
res = df.groupby('ID', as_index=False).agg({'Year': 'min', 'Consumption': 'max'})
print(res)
ID Year Consumption
0 1 2012 24
1 2 2012 23
2 3 2012 24
3 4 2013 25
Another alternative to groupby is pivot_table:
pd.pivot_table(df, index="ID", aggfunc={"Year":min, "Consumption":max})

Matching 'Date' dataframes in Pandas to enable joins/merging

I have two csv files with pandas dataframes with a 'Date' column, which is my desired target to join the two tables (my goal is to join my two csvs by dates and merge matching dataframes by summing them).
The issue is that despite sharing the same month-year format, my first csv abbreviated the years, whereas my desired output would be mm-yyyy (for example, Aug-2012 as opposed to Aug-12).
csv1:
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
...
has 41 rows; i.e. 41 months worth of data between Oct. 12 - Feb. 16
csv2:
0 Jan-2009 943690
1 Feb-2009 1062565
2 Mar-2009 210079
3 Apr-2009 -735286
4 May-2009 842933
5 Jun-2009 358691
6 Jul-2009 914953
7 Aug-2009 723427
8 Sep-2009 -837468
...
has 86 rows; i.e. 41 months worth of data between Jan. 2009 - Feb. 2016
I tried initially to do something akin to a 'find and replace' function as one would in Excel.
I tried :
findlist = ['12','13','14','15','16']
replacelist = ['2012','2013','2014','2015','2016']
def findReplace(find, replace):
s = csv1_df.read()
s = s.replace(Date, replacement)
csv1_dfc.write(s)
for item, replacement in zip(findlist, replacelist):
s = s.replace(Date, replacement)
But I am getting a
NameError: name 's' is not defined
You can use to_datetime to transform to datetime format, and then strftime to adjust your format:
df['col_date'] = pd.to_datetime(df['col_date'], format="%b-%y").dt.strftime('%b-%Y')
Input:
col_date val
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
Output:
col_date val
0 Oct-2012 1154293
1 Nov-2012 885773
2 Dec-2012 -448704
3 Jan-2013 563679
4 Feb-2013 555394
5 Mar-2013 631974
6 Apr-2013 957395
7 May-2013 1104047
8 Jun-2013 693464
Note the lower case y for 2 digits year and upper case Y for 4 digits year.

Categories