Extracting Groups - python

Using Python 3.2 I was hoping to solve the below issue. My data consist of hundreds of rows (signifying a project) and 21 columns. The first of which is a unique project ID and the other 20 columns is the group of people, or person, that led the project. person_1 is always filled and if there is a name in person_3 that means 3 people are working together. If there is a name in person_18 that means 18 people are working together.
I have an excel spreadsheet that is setup the following way:
unique ID person_1 person _2 person_3 person_4 ... person_20
12 Tom Sally Mike
16 Joe Mike
5 Joe Sally
1 Sally Mike Tom
6 Sally Tom Mike
2 Jared Joe Mike John ... Carl
I want to do a few things:
1) Make a column that will give me a unique 'Group Name' which will be, using unique ID 1 as my example, Sally/Mike/Tom. So it will be the names separated by '/'.
2) How can I treat, from my example, Sally/Mike/Tom the same as Sally/Tom/Mike. Meaning, I would like another column that makes the group name in alphabetical order (no matter the actual permutation), still separated by '/'.
3) This question is similar to (2). However, I want the person listed in person_1 to matter. Meaning Joe/Tom/Mike is different from Tom/Joe/Mike but not different than Joe/Mike/Tom. So there will be another column that keeps person_1 at the start of the group name but alphabetizes person_2 through person_20 if applicable (i.e., if the project has more than 1 person on it).
Thanks for the help and suggestions

The previous answer gave a clear statement of method, but perhaps you are stuck on either the string processing or the csv processing. Both are demonstrated in the following code. The relevant string methods are sorted and join. '/'.join tells join to use / as separator between joined items. The + operator between lists in tname and writerow statements concatenates the lists. A csv.reader is an iterator that delivers one list per row, and a csv.writer converts a list to a row and writes it out. You will want to add error testing to the file opens, etc. The data file used to test this code is shown after the code.
import csv
fi = open('xgroup.csv')
fo = open('xgroup3.csv', 'w')
w = csv.writer(fo)
r = csv.reader(fi)
li = 0
print "Opened reader and writer"
for row in r:
gname = '/'.join(row[1:])
sname = '/'.join(sorted(row[1:]))
tname = '/'.join([row[1]]+sorted(row[2:]))
w.writerow([row[0], gname, sname, tname]+row[1:])
li += 1
fi.close()
fo.close()
print "Closed reader and writer after",li,"lines"
File xgroup.csv is shown next.
unique-ID,person_1,person,_2,person_3,person_4,...,person_20
12,Tom,Sally,Mike
16,Joe,Mike
5,Joe,Sally
1,Sally,Mike,Tom
6,Sally,Tom,Mike
2,Jared,Joe,Mike,John,...,Carl
Upon reading data as above, the program prints Opened reader and writer and Closed reader and writer after 7 lines and produces output in file xgroup3.csv as shown next.
unique-ID,person_1/person/_2/person_3/person_4/.../person_20,.../_2/person/person_1/person_20/person_3/person_4,person_1/.../_2/person/person_20/person_3/person_4,person_1,person,_2,person_3,person_4,...,person_20
12,Tom/Sally/Mike,Mike/Sally/Tom,Tom/Mike/Sally,Tom,Sally,Mike
16,Joe/Mike,Joe/Mike,Joe/Mike,Joe,Mike
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
1,Sally/Mike/Tom,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Mike,Tom
6,Sally/Tom/Mike,Mike/Sally/Tom,Sally/Mike/Tom,Sally,Tom,Mike
2,Jared/Joe/Mike/John/.../Carl,.../Carl/Jared/Joe/John/Mike,Jared/.../Carl/Joe/John/Mike,Jared,Joe,Mike,John,...,Carl
Note, given a data line like
5,Joe,Sally,,,,,
instead of
5,Joe,Sally
the program as above produces
5,Joe/Sally/////,/////Joe/Sally,Joe//////Sally,Joe,Sally,,,,,
instead of
5,Joe/Sally,Joe/Sally,Joe/Sally,Joe,Sally
If that's a problem, filter out empty entries. For example, if
row=['5', 'Joe', 'Sally', '', '', '', '', ''], then
'/'.join(row[1:]) produces
'Joe/Sally/////', while
'/'.join(filter(lambda x: x, row[1:])) and
'/'.join(x for x in row[1:] if x) and
'/'.join(filter(len, row[1:])) produce
'Joe/Sally' .

You could do the following:
Export your file to a .csv file from Excel
Open that input file using python's csv module, using csv.reader
Open another file (output) to write to it using csv.writer
Iterate over each row in your reader, do your treatment, and write that using your writer
Import the output file in Excel

Related

How to split two first names that together in two different words in python

I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.

convert comment (list) to dataframe ,pandas

I have big list of names , I want to keep it in my interpreter so I would like not use csv files.
The only way how i can store it in my interpreter as variable using 'copy -paste' from my original file is comment
so my input looks like this :
temp='''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
my goal is to convert this 'comment' inside my interpreter to dataframe
i tried to
df=pd.DataFrame([temp]) and also to series using in comment only one column but without success, any idea?
my read data have hundreds of lines
Use:
from io import StringIO
temp=u'''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
df = pd.read_csv(StringIO(temp))
print (df)
A B C
0 adam dorothy ben
1 luis cristy hoover

How do I combine multiple rows of a CSV that share data into one row using Pandas?

I have downloaded the ASCAP database, giving me a CSV that is too large for Excel to handle. I'm able to chunk the CSV to open parts of it, the problem is that the data isn't super helpful in its default format. Each song title has 3+ rows associated with it:
The first row include the % share that ASCAP has in that song.
The rows after that include a character code (ROLE_TYPE) that indicates if that row contains the writer or performer of that song.
The first column of each row contains a song title.
This structure makes the data confusing because on the rows that list the % share there are blank cells in the NAME column because that row does not have a Writer/Performer associated with it.
What I would like to do is transform this data from having 3+ rows per song to having 1 row per song with all relevant data.
So instead of:
TITLE, ROLE_TYPE, NAME, SHARES, NOTE
I would like to change the data to:
TITLE, WRITER, PERFORMER, SHARES, NOTE
Here is a sample of the data:
TITLE,ROLE_TYPE,NAME,SHARES,NOTE
SCORE MORE,ASCAP,Total Current ASCAP Share,100,
SCORE MORE,W,SMITH ANTONIO RENARD,,
SCORE MORE,P,SMITH SHOW PUBLISHING,,
PEOPLE KNO,ASCAP,Total Current ASCAP Share,100,
PEOPLE KNO,W,SMITH ANTONIO RENARD,,
PEOPLE KNO,P,SMITH SHOW PUBLISHING,,
FEEDBACK,ASCAP,Total Current ASCAP Share,100,
FEEDBACK,W,SMITH ANTONIO RENARD,,
I would like the data to look like:
TITLE, WRITER, PERFORMER, SHARES, NOTE
SCORE MORE, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
PEOPLE KNO, SMITH ANTONIO RENARD, SMITH SHOW PUBLISHING, 100,
FEEDBACK, SMITH ANONIO RENARD, SMITH SHOW PUBLISHING, 100,
I'm using python/pandas to try and work with the data. I am able to use groupby('TITLE') to group rows with matching titles.
import pandas as pd
data = pd.read_csv("COMMA_ASCAP_TEXT.txt", low_memory=False)
title_grouped = data.groupby('TITLE')
for TITLE,group in title_grouped:
print(TITLE)
print(group)
I was able to groupby('TITLE') of each song, and the output I get seems close to what I want:
SCORE MORE
TITLE ROLE_TYPE NAME SHARES NOTE
0 SCORE MORE ASCAP Total Current ASCAP Share 100.0 NaN
1 SCORE MORE W SMITH ANTONIO RENARD NaN NaN
2 SCORE MORE P SMITH SHOW PUBLISHING NaN NaN
What do I need to do to take this group and produce a single row in a CSV file with all the data related to each song?
I would recommend:
Decompose the data by the ROLE_TYPE
Prepare the data for merge (rename columns and drop unnecessary columns)
Merge everything back into one DataFrame
Merge will be automatically performed over the column which has the same name in the DataFrames being merged (TITLE in this case).
Seems to work nicely :)
data = pd.read_csv("data2.csv", sep=",")
# Create 3 individual DataFrames for different roles
data_ascap = data[data["ROLE_TYPE"] == "ASCAP"].copy()
data_writer = data[data["ROLE_TYPE"] == "W"].copy()
data_performer = data[data["ROLE_TYPE"] == "P"].copy()
# Remove unnecessary columns for ASCAP role
data_ascap.drop(["ROLE_TYPE", "NAME"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for WRITER role
data_writer.rename(index=str, columns={"NAME": "WRITER"}, inplace=True)
data_writer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Rename columns and remove unnecesary columns for PERFORMER role
data_performer.rename(index=str, columns={"NAME": "PERFORMER"}, inplace=True)
data_performer.drop(["ROLE_TYPE", "SHARES", "NOTE"], axis=1, inplace=True)
# Merge all together
result = data_ascap.merge(data_writer, how="left")
result = result.merge(data_performer, how="left")
# Print result
print(result)

Python Sorting and Organising

I'm trying to sort data from a file and not quiet getting what i need. I have a text file with race details ( name placement( ie 1,2,3). I would like to be able to organize the data by highest placement first and also alphabetically by name. I can do this if i split the lines but then the name and score will not match up.
Any help and suggestion would be very welcomed, I've hit that proverbial wall.
My apologies ( first time user for this site , and python noob, steep learning curve ) Thank you for your suggestions , i really do appreciate the help.
comp=[]
results = open('d:\\test.txt', 'r')
for line in results:
line=line.split()
# (name,score)= line.split()
comp.append(line)
sorted(comp)
results.close()
print (comp)
Test file was in this format:
Jones 2
Ranfel 7
Peterson 5
Smith 1
Simons 9
Roberts 4
McDonald 3
Rogers 6
Elliks 8
Helm 10
I completely agree with everyone who has down-voted this question for being badly posed. However, I'm in a good mood so I'll try and at least steer you in the right direction:
Let's assume your text file looks like this:
Name,Placement
D,1
D,2
C,1
C,3
B,1
B,3
A,1
A,4
I suggest importing the data and sorting it using Pandas http://pandas.pydata.org/
import pandas as pd
# Read in the data
# Replace <FULL_PATH_OF FILE> with something like C:/Data/RaceDetails.csv
# The first row is automatically used for column names
data=pd.read_csv("<FULL_PATH_OF_FILE>")
# Sort the data
sorted_data=data.sort(['Placement','Name'])
# Create a re-indexed data frame if you so desire
sorted_data_new_index=sorted_data.reset_index(drop=True)
This gives me:
Name Placement
A 1
B 1
C 1
D 1
D 2
B 3
C 3
A 4
I'll leave you to figure out the rest..
As #Jack said, I am very limited to how I can help if you don't post code or the txt file. However, I've run into a similar problem before, so I know the basics (again, will need code/files before I can give an exact type-this-stuff answer!)
You can either develop an algorithm yourself, or use the built-in sorted feature
Put the names and scores in a list (or dictionary) such as:
name_scores = [['Matt', 95], ['Bob', 50], ['Ashley', 100]]
and then call sorted(name_scores) and it will sort by names: [['Ashley', 100], ['Bob', 50], ['Matt', 95]]

python csv: getting subset

here is a snapshot of my csv:
alex 123f 1
harry fwef 2
alex sef 3
alex gsdf 4
alex wf35 6
harry sdfsdf 3
i would like to get the subset of this data where the occurrence of anything in the first column (harry, alex) is at least 4. so i want the resulting data set to be:
alex 123f 1
alex sef 3
alex gsdf 4
alex wf35 6
Clearly, you cannot decide which rows are interesting until you've seen all rows (since the very last row might be the one turning some count from three to four and thereby making some previously seen rows interesting, for example;-). So, unless your CSV file is horribly huge, suck it all into memory, first, as a list...:
import csv
with open('thefile.csv', 'rb') as f:
data = list(csv.reader(f))
then, do the counting -- Python 2.7 has a better way, but assuming you're still on 2.6 like most of us...:
import collections
counter = collections.defaultdict(int)
for row in data:
counter[row[0]] += 1
and finally do the selection loop...:
for row in data:
if counter[row[0]] >= 4:
print row
Of course, this prints each interesting row as a roughly-hewed list (with square brackets and quotes around the items), but it will be easy to format it in any way you might prefer.
if Python is not a must
$ gawk '{b[$1]++;c[++d,$1]=$0}END{for(i in b){if(b[i]>=4){for(j=1;j<=d;j++){print c[j,i]}}}}' file
And yes, 70MB file is fine.

Categories