I have a first dataframe containing a serie called 'Date', and a variable number n of series called 'People_1' to 'People_n' :
Id Date People_1 People_2 People_3 People_4 People_5 People_6 People_7
12.0 Sat Dec 19 00:00:00 EST 1970 Loretta Lynn Owen Bradley
13.0 Sat Jun 07 00:00:00 EDT 1980 Sissy Spacek Loretta Lynn Owen Bradley
14.0 Sat Dec 04 00:00:00 EST 2010 Loretta Lynn Sheryl Crow Miranda Lambert
15.0 Sat Aug 09 00:00:00 EDT 1969 Charley Pride Dallas Frazier A.L. "Doodle" Chet Atkins Jack Clement Bob Ferguson Felton Jarvis
I also have another dataframe containing a list of names and biographic datas :
People Birth_date Birth_state Sex Ethnicity
Charles Kelley Fri Sep 11 00:00:00 EDT 1981 GA Male Caucasian
Hillary Scott Tue Apr 01 00:00:00 EST 1986 TN Female Caucasian
Reba McEntire Mon Mar 28 00:00:00 EST 1955 OK Female Caucasian
Wanda Jackson Wed Oct 20 00:00:00 EST 1937 OK Female Caucasian
Carrie UnderwoodThu Mar 10 00:00:00 EST 1983 OK Female Caucasian
Toby Keith Sat Jul 08 00:00:00 EDT 1961 OK Male Caucasian
David Bellamy Sat Sep 16 00:00:00 EDT 1950 FL Male Caucasian
Howard Bellamy Sat Feb 02 00:00:00 EST 1946 FL Male Caucasian
Keith Urban Thu Oct 26 00:00:00 EDT 1967 Northland Male Caucasian
Miranda Lambert Thu Nov 10 00:00:00 EST 1983 TX Female Caucasian
Sam Hunt Sat Dec 08 00:00:00 EST 1984 GA Male Caucasian
Johnny Cash Fri Feb 26 00:00:00 EST 1932 AR Male Caucasian
June Carter Sun Jun 23 00:00:00 EDT 1929 VA Female Caucasian
Merle Haggard Tue Apr 06 00:00:00 EST 1937 CA Male Caucasian
Waylon Jennings Tue Jun 15 00:00:00 EDT 1937 TX Male Caucasian
Willie Nelson Sat Apr 29 00:00:00 EST 1933 TX Male Caucasian
Loretta Lynn Thu Apr 14 00:00:00 EST 1932 KY Female Caucasian
Sissy Spacek Sun Dec 25 00:00:00 EST 1949 TX Female Caucasian
Sheryl Crow Sun Feb 11 00:00:00 EST 1962 MO Female Caucasian
Charley Pride Sun Mar 18 00:00:00 EST 1934 MS Male African American
Rodney Clawon ? TX Male Caucasian
Nathan Chapman ? TN Male Caucasian
I want to get for each date the biographic datas of each people involved that day :
Date Birth_state Sex Ethnicity
Sat Dec 19 00:00:00 EST 1970 KY Female Caucasian
Sat Jun 07 00:00:00 EDT 1980 TX Female Caucasian
Sat Jun 07 00:00:00 EDT 1980 KY Female Caucasian
Sat Dec 04 00:00:00 EST 2010 KY Female Caucasian
Sat Dec 04 00:00:00 EST 2010 MO Female Caucasian
Sat Dec 04 00:00:00 EST 2010 TX Female Caucasian
Sat Aug 09 00:00:00 EDT 1969 MS Male African American
Precision :
Consider that my bio datas aren't complete yet, some names are missing, which explains why I don't have a row for each person.
So is there a way to perform this task in Python ?
Lionel
You may to use left join in pandas to join the two tables first and then select the columns you need.
For example, you can first aggregate all person into a new column, for example, named 'person'. Then add one single row for each person. After you finished doing this, then left join two dataframe as you did.
Related
I have a large panel data in a pandas DataFrame:
import pandas as pd
df = pd.read_csv('Qs_example_data.csv')
df.head()
ID Year DOB status YOD
223725 1991 1975.0 No 2021
223725 1992 1975.0 No 2021
223725 1993 1975.0 No 2021
223725 1994 1975.0 No 2021
223725 1995 1975.0 No 2021
I want to drop the rows based on the following condition:
If the value in YOD matches the value in Year then all rows after that matching row for that ID are dropped, or if a Yes is observed in the column status for that ID.
For example in the DataFrame, ID 68084329 has the values 2012 in the DOB and YOD columns on row 221930. All rows after 221930 for 68084329 should be dropped.
df.loc[x['ID'] == 68084329]
ID Year DOB status YOD
221910 68084329 1991 1942.0 No 2012
221911 68084329 1992 1942.0 No 2012
221912 68084329 1993 1942.0 No 2012
221913 68084329 1994 1942.0 No 2012
221914 68084329 1995 1942.0 No 2012
221915 68084329 1996 1942.0 No 2012
221916 68084329 1997 1942.0 No 2012
221917 68084329 1998 1942.0 No 2012
221918 68084329 1999 1942.0 No 2012
221919 68084329 2000 1942.0 No 2012
221920 68084329 2001 1942.0 No 2012
221921 68084329 2002 1942.0 No 2012
221922 68084329 2003 1942.0 No 2012
221923 68084329 2004 1942.0 No 2012
221924 68084329 2005 1942.0 No 2012
221925 68084329 2006 1942.0 No 2012
221926 68084329 2007 1942.0 No 2012
221927 68084329 2008 1942.0 No 2012
221928 68084329 2010 1942.0 No 2012
221929 68084329 2011 1942.0 No 2012
221930 68084329 2012 1942.0 Yes 2012
221931 68084329 2013 1942.0 No 2012
221932 68084329 2014 1942.0 No 2012
221933 68084329 2015 1942.0 No 2012
221934 68084329 2016 1942.0 No 2012
221935 68084329 2017 1942.0 No 2012
I have a lot of IDs that have rows which need to be dropped in accordance with the above condition. How do I do this?
The following code should also work:
result=df[0:0]
ids=[]
for i in df.ID:
if i not in ids:
ids.append(i)
for k in ids:
temp=df[df.ID==k]
for j in range(len(temp)):
result=pd.concat([result, temp.iloc[j:j+1, :]])
if temp.iloc[j, :]['status']=='Yes':
break
print(result)
This should do. From your wording, it wasn't clear whether you need to "drop all the rows after you encounter a Yes for that ID", or "just the rows you encounter a Yes in". I assumed that you need to "drop all the rows after you encounter a Yes for that ID".
import pandas as pd
def __get_nos__(df):
return df.iloc[0:(df['Status'] != 'Yes').values.argmin(), :]
df = pd.DataFrame()
df['ID'] = [12345678]*10 + [13579]*10
df['Year'] = list(range(2000, 2010))*2
df['DOB'] = list(range(2000, 2010))*2
df['YOD'] = list(range(2000, 2010))*2
df['Status'] = ['No']*5 + ['Yes']*5 + ['No']*7 + ['Yes']*3
""" df
ID Year DOB YOD Status
0 12345678 2000 2000 2000 No
1 12345678 2001 2001 2001 No
2 12345678 2002 2002 2002 No
3 12345678 2003 2003 2003 No
4 12345678 2004 2004 2004 No
5 12345678 2005 2005 2005 Yes
6 12345678 2006 2006 2006 Yes
7 12345678 2007 2007 2007 Yes
8 12345678 2008 2008 2008 Yes
9 12345678 2009 2009 2009 Yes
10 13579 2000 2000 2000 No
11 13579 2001 2001 2001 No
12 13579 2002 2002 2002 No
13 13579 2003 2003 2003 No
14 13579 2004 2004 2004 No
15 13579 2005 2005 2005 No
16 13579 2006 2006 2006 No
17 13579 2007 2007 2007 Yes
18 13579 2008 2008 2008 Yes
19 13579 2009 2009 2009 Yes
"""
df.groupby('ID').apply(lambda x: __get_nos__(x)).reset_index(drop=True)
""" Output
ID Year DOB YOD Status
0 13579 2000 2000 2000 No
1 13579 2001 2001 2001 No
2 13579 2002 2002 2002 No
3 13579 2003 2003 2003 No
4 13579 2004 2004 2004 No
5 13579 2005 2005 2005 No
6 13579 2006 2006 2006 No
7 12345678 2000 2000 2000 No
8 12345678 2001 2001 2001 No
9 12345678 2002 2002 2002 No
10 12345678 2003 2003 2003 No
11 12345678 2004 2004 2004 No
"""
My Question is as follows, i have a data set ~ 700mb which looks like
rpt_period_name_week period_name_mth assigned_date_utc resolved_date_utc handle_seconds action marketplace_id login category currency_code order_amount_in_usd day_of_week_NewClmn
2020 Week 01 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 84 Pass DE a MRI AT EUR 81.32 Saturday
2020 Week 02 2020 / 01 1/11/2020 23:58 1/11/2020 23:59 37 Pass DE b MRI AQ EUR 222.38 Saturday
2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 123 Pass DE a MRI DG EUR 444.77 Saturday
2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:59 313 Hold JP a MRI AQ Saturday
2020 Week 01 2020 / 01 1/11/2020 23:57 1/11/2020 23:59 112 Pass FR b MRI DG EUR 582.53 Saturday
2020 Week 02 2020 / 01 1/11/2020 23:54 1/11/2020 23:58 249 Pass DE f MRI AT EUR 443.16 Saturday
2020 Week 03 2020 / 01 1/11/2020 23:58 1/11/2020 23:58 48 Pass DE b MRI DG EUR 20.5 Saturday
2020 Week 03 2020 / 01 1/11/2020 23:57 1/11/2020 23:58 40 Pass IT a MRI AQ EUR 272.01 Saturday
my desired output is like
[Output][1]
https://i.stack.imgur.com/8oz7G.png
My code is below but i am unable to get the desire result? My cells are getting divided by sum of row?
Have tried multiple options but in vain?
df = data_final.groupby(['login','category','rpt_period_name_week','action'])['action'].agg(np.count_nonzero).unstack(['rpt_period_name_week','action']).apply(lambda x: x.fillna(0))
df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1))
# df = df.div(df.sum(1), 0).mul(100).round(2).assign(Total=lambda df: df.sum(axis=1))
df1 = df.astype(str) + '%'
# print (df1)
Please help?
I need to read from a text file, then print the information separately.
for example:
i'm given a list of names in this format: Orville Wright 21 July 1988
And i need to make the outcome as so:
Name
1. Orville Wright
Date
1. 21 July 1988
I've tried using a reader to separate but I would have to have a separate code line for every name and date given as they are not the same length.
with open('File name and location', 'r') as reader:
print(reader.readline(14))
``````````````````````````````````````````````````
this is the outcome : Orville Wright
```````````````````````````````````````````````````
I want my results to be:
Name:
1. Orville Wright
2. Rogelio Holloway
etc
Date:
1. 21 July 1988
2. 13 September 1988
etc
````````````````````````````````````````````````````
The contents of the file are as follows:
Orville Wright 21 July 1988
Rogelio Holloway 13 September 1988
Marjorie Figueroa 9 October 1988
Debra Garner 7 February 1988
Tiffany Peters 25 July 1988
Hugh Foster 2 June 1988
Darren Christensen 21 January 1988
Shelia Harrison 28 July 1988
Ignacio James 12 September 1988
Jerry Keller 30 February 1988
Frankie Cobb 1 July 1988
Clayton Thomas 10 December 1988
Laura Reyes 9 November 1988
Danny Jensen 19 September 1988
Sabrina Garcia 20 October 1988
Winifred Wood 27 July 1988
Juan Kennedy 4 March 1988
Nina Beck 7 May 1988
Tanya Marshall 22 May 1988
Kelly Gardner 16 August 1988
Cristina Ortega 13 January 1988
Guy Carr 21 June 1988
Geneva Martinez 5 September 1988
Ricardo Howell 23 December 1988
Bernadette Rios 19 July 1988
This is one approach using regex.
Ex:
import re
names = []
dates = []
with open(filename) as infile:
for line in infile:
line = line.strip()
date = re.search("(\d{1,2} [a-zA-Z]+ \d{4})", line).group(1) #Extract Date.
dates.append(date)
names.append(line.replace(date, "").strip()) #Get Name.
print("Name:")
for name in names:
print(name)
print("---"*10)
print("Date:")
for date in dates:
print(date)
Output:
Name:
Orville Wright
Rogelio Holloway
Marjorie Figueroa
Debra Garner
Tiffany Peters
Hugh Foster
Darren Christensen
Shelia Harrison
Ignacio James
Jerry Keller
Frankie Cobb
Clayton Thomas
Laura Reyes
Danny Jensen
Sabrina Garcia
Winifred Wood
Juan Kennedy
Nina Beck
Tanya Marshall
Kelly Gardner
Cristina Ortega
Guy Carr
Geneva Martinez
Ricardo Howell
Bernadette Rios
------------------------------
Date:
21 July 1988
13 September 1988
9 October 1988
7 February 1988
25 July 1988
2 June 1988
21 January 1988
28 July 1988
12 September 1988
30 February 1988
1 July 1988
10 December 1988
9 November 1988
19 September 1988
20 October 1988
27 July 1988
4 March 1988
7 May 1988
22 May 1988
16 August 1988
13 January 1988
21 June 1988
5 September 1988
23 December 1988
19 July 1988
Store all the names and dates inside different lists, then display each one.
The following code assumes that each name & date are separated by a newline, and that the first digit in that line is the start of the date.
import re
names = []
dates = []
with open('File name and location', 'r') as reader:
for line in reader.readlines():
date_position = re.search("\d", line).start()
names.append(line[:date_position - 1])
dates.append(line[date_position:])
Now you can print each name and date to your liking:
for i, name in enumerate(names):
print(f"{i+1}. {name}")
And for dates:
for i, date in enumerate(dates):
print(f"{i+1}. {name}")
Output (for part of the text file):
1. Orville Wright
2. Rogelio Holloway
3. Marjorie Figueroa
4. Debra Garner
5. Tiffany Peters
6. Hugh Foster
7. Darren Christensen
8. Shelia Harrison
9. Ignacio James
10. Jerry Keller
11. Frankie Cobb
12. Clayton Thomas
13. Laura Reyes
14. Danny Jensen
15. Sabrina Garcia
16. Winifred Wood
17. Juan Kennedy
1. 21 July 1988
2. 13 September 1988
3. 9 October 1988
4. 7 February 1988
5. 25 July 1988
6. 2 June 1988
7. 21 January 1988
8. 28 July 1988
9. 12 September 1988
10. 30 February 1988
11. 1 July 1988
12. 10 December 1988
13. 9 November 1988
14. 19 September 1988
15. 20 October 1988
16. 27 July 1988
17. 4 March 1988
I have a data frame that has a 3 columns.
Time represents every day of the month for various months. what I am trying to do is get the 'Count' value per day and average it per each month, and do this for each country. The output must be in the form of a data frame.
Curent data:
Time Country Count
2017-01-01 us 7827
2017-01-02 us 7748
2017-01-03 us 7653
..
..
2017-01-30 us 5432
2017-01-31 us 2942
2017-01-01 us 5829
2017-01-02 ca 9843
2017-01-03 ca 7845
..
..
2017-01-30 ca 8654
2017-01-31 ca 8534
Desire output (dummy data, numbers are not representative of the DF above):
Time Country Monthly Average
Jan 2017 us 6873
Feb 2017 us 8875
..
..
Nov 2017 us 9614
Dec 2017 us 2475
Jan 2017 ca 1878
Feb 2017 ca 4775
..
..
Nov 2017 ca 7643
Dec 2017 ca 9441
I'd organize it like this:
df.groupby(
[df.Time.dt.strftime('%b %Y'), 'Country']
)['Count'].mean().reset_index(name='Monthly Average')
Time Country Monthly Average
0 Feb 2017 ca 88.0
1 Feb 2017 us 105.0
2 Jan 2017 ca 85.0
3 Jan 2017 us 24.6
4 Mar 2017 ca 86.0
5 Mar 2017 us 54.0
If your 'Time' column wasn't already a datetime column, I'd do this:
df.groupby(
[pd.to_datetime(df.Time).dt.strftime('%b %Y'), 'Country']
)['Count'].mean().reset_index(name='Monthly Average')
Time Country Monthly Average
0 Feb 2017 ca 88.0
1 Feb 2017 us 105.0
2 Jan 2017 ca 85.0
3 Jan 2017 us 24.6
4 Mar 2017 ca 86.0
5 Mar 2017 us 54.0
Use pandas dt strftime to create a month-year column that you desire + groupby + mean. Used this dataframe:
Dated country num
2017-01-01 us 12
2017-01-02 us 12
2017-02-02 us 134
2017-02-03 us 76
2017-03-30 us 54
2017-01-31 us 29
2017-01-01 us 58
2017-01-02 us 12
2017-02-02 ca 98
2017-02-03 ca 78
2017-03-30 ca 86
2017-01-31 ca 85
Then create a Month-Year column:
a['MonthYear']= a.Dated.dt.strftime('%b %Y')
Then, drop the Date column and aggregate by mean:
a.drop('Dated', axis=1).groupby(['MonthYear','country']).mean().rename(columns={'num':'Averaged'}).reset_index()
MonthYear country Averaged
Feb 2017 ca 88.0
Feb 2017 us 105.0
Jan 2017 ca 85.0
Jan 2017 us 24.6
Mar 2017 ca 86.0
Mar 2017 us 54.0
I retained the Dated column just in case.
I would like to group my data by country then by year and sum up the value columns using pandas. Currently I am reading in the csv file and using the following:
data_cleaned= df.groupby(['Country', 'year'], as_index=False).sum()
Here is a sample of my dataset:
Country year value
Angola 2009 0
Angola 2009 0
Angola 2010 0
Angola 2010 0
Angola 2010 0
Angola 2010 0
Angola 2011 0
Angola 2011 0
Angola 2011 0
Angola 2011 0
Angola 2012 118
Angola 2012 0
Angola 2012 0
Angola 2012 0
Angola 2013 0
Angola 2013 0
Angola 2013 0
Angola 2013 0
Angola 2014 0
Angola 2014 0
Angola 2014 0
Angola 2014 0
Angola 2015 0
Angola 2015 0
Angola 2015 0
Angola 2015 0
Angola 2016 0
Angola 2016 0
Angola 2016 0
Angola 2016 0
Angola 2017 0
Australia 2009 0
Australia 2009 14
Australia 2009 0
Australia 2009 12
Australia 2010 0
Australia 2010 0
Australia 2010 54
Australia 2010 6
Australia 2011 0
Australia 2011 4
Australia 2011 17
Australia 2011 13
Australia 2012 8
Australia 2012 2
Australia 2012 4
Australia 2012 105
Australia 2013 0
Australia 2013 5
Australia 2013 0
Australia 2013 0
Australia 2014 0
Australia 2014 0
Australia 2014 0
Australia 2014 0
Australia 2015 0
Australia 2015 0
Australia 2015 0
Australia 2015 0
Australia 2016 0
Australia 2016 0
Australia 2016 0
Australia 2016 0
Australia 2017 0
But I get the following results:
Partner Country year value
0 Angola 2009 0.00
1 Angola 2010 0.00
2 Angola 2011 0.00
3 Angola 2012 86,280.00
4 Angola 2013 0.00
5 Angola 2014 0.00
6 Angola 2015 0.00
7 Angola 2016 0.00
8 Angola 2017 0.00
9 Australia 2009 54,879.00
10 Australia 2010 67,899.00
11 Australia 2011 50,965.00
12 Australia 2012 332,128.00
13 Australia 2013 16,515.00
14 Australia 2014 0.00
15 Australia 2015 0.00
16 Australia 2016 0.00
17 Australia 2017 0.00
Which is obviously wrong since Angola only has one non-zero value and it's in 2012, which is the correct year to have a value but I'm expecting 118 instead of 86,280.00. Could someone maybe point out what I am doing wrong and how I can correctly sum the value column based on the Country and year columns?