I am working with the sklearn.datasets.fetch_20newsgroups() dataset. Here, there are some documents that belong to more than one news group. I want to treat those documents as two different entities that each belong to one news group. To do this, I've brought the document IDs and group names into a dataframe.
import sklearn
from sklearn import datasets
data = datasets.fetch_20newsgroups()
filepaths = data.filenames.astype(str)
keys = []
for path in filepaths:
keys.append(os.path.split(path)[1])
groups = pd.DataFrame(keys, columns = ['Document_ID'])
groups['Group'] = data.target
groups.head()
>> Document_ID Group
0 102994 7
1 51861 4
2 51879 4
3 38242 1
4 60880 14
print (len(groups))
>>11314
print (len(groups['Document_ID'].drop_duplicates()))
>>9840
print (len(groups['Group'].drop_duplicates()))
>>20
For each Document_ID, I want to change its value if it has more than one Group number assigned. Example,
groups[groups['Document_ID']=='76139']
>> Document_ID Group
5392 76139 6
5680 76139 17
I want this to become:
>> Document_ID Group
5392 76139 6
5680 12345 17
Here, 12345 is a random new ID that is not already in keys list.
How can I do this?
You can find all the rows that contain duplicate Document_ID after the first with the duplicated methdod. Then create a list of new id's beginning with one more than the max id. Use the loc indexing operator to overwrite the duplicate keys with the new ids.
groups['Document_ID'] = groups['Document_ID'].astype(int)
dupes = groups.Document_ID.duplicated(keep='first')
max_id = groups.Document_ID.max() + 1
new_id = range(max_id, max_id + dupes.sum())
groups.loc[dupes, 'Document_ID'] = new_id
Test case
groups.loc[[5392,5680]]
Document_ID Group
5392 76139 6
5680 179489 17
Ensure that no duplicates remain.
groups.Document_ID.duplicated(keep='first').any()
False
Kinda Hacky, but why not!
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"Group": [7,1,3,4,4,6,17],
}
groups = pd.DataFrame(data)
groupDict ={}
tempLst=[]
#Create a list of unique ID's
DocList = groups['Document_ID'].unique()
DocList.tolist()
#Build a dictionary and push all group ids to the correct doc id
DocDict = {}
for x in DocList:
DocDict[x] = []
for index, row in groups.iterrows():
DocDict[row['Document_ID']].append(row['Group'])
#For all doc Id's with multip entries create a new id with the group id as a decimal point.
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x]))
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"])
Hope that helps...
Related
I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan
I have two dictionaries:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
and a table consisting of one single column where bond names are contained:
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
I need to replace the name with a string of the following format: EUA21 where the first two letters are the corresponding value to the currency key in the dictionary, the next letter is the value corresponding to the month key and the last two digits are the year from the name.
I tried to split the name using the following code:
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
but I am not sure how to proceed from here to create the string as I need to search the dictionaries at the same time for the currency and month extract the values join them and add the year from the name onto it.
This will give you a list of what you need:
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = {'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']}
result = []
for names in bond_names['Names']:
bond = names.split('.')
result.append(currency[bond[1]] + time[bond[2]] + bond[3])
print(result)
You can do that like this:
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency = {'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names = pd.DataFrame({'Names': ['Bond.USD.JAN.21', 'Bond.USD.MAR.25', 'Bond.EUR.APR.22', 'Bond.HUF.JUN.21', 'Bond.HUF.JUL.23', 'Bond.GBP.JAN.21']})
bond_names['Names2'] = bond_names['Names'].apply(lambda x: currency[x[5:8]] + time[x[9:12]] + x[-2:])
print(bond_names['Names2'])
# 0 USA21
# 1 USC25
# 2 EUD22
# 3 HFF21
# 4 HFH23
# 5 GBA21
# Name: Names2, dtype: object
With extended regex substitution:
In [42]: bond_names['Names'].str.replace(r'^[^.]+\.([^.]+)\.([^.]+)\.(\d+)', lambda m: '{}{}{}'.format(curre
...: ncy.get(m.group(1), m.group(1)), time.get(m.group(2), m.group(2)), m.group(3)))
Out[42]:
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
Name: Names, dtype: object
You can try this :
import pandas as pd
time = {'JAN':'A','FEB':'B','MAR':'C','APR':'D','MAY':'E','JUN':'F','JUL':'H'}
currency={'USD':'US','EUR':'EU','GBP':'GB','HUF':'HF'}
bond_names=pd.DataFrame({'Names':['Bond.USD.JAN.21','Bond.USD.MAR.25','Bond.EUR.APR.22','Bond.HUF.JUN.21','Bond.HUF.JUL.23','Bond.GBP.JAN.21']})
bond_names['Names']=bond_names['Names'].apply(lambda x: x.split('.'))
for idx, bond in enumerate(bond_names['Names']):
currencyID = currency.get(bond[1])
monthID = time.get(bond[2])
yearID = bond[3]
bond_names['Names'][idx] = currencyID + monthID + yearID
Output
Names
0 USA21
1 USC25
2 EUD22
3 HFF21
4 HFH23
5 GBA21
Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.
As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
Any ideas?
So here is one way using merge then sort the columns
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999
I have a dataframe and the first column contains id. How do I sort the first column when it contains alphanumeric data, such as:
id = ["6LDFTLL9", "N9RFERBG", "6RHSDD46", "6UVSCF4H", "7SKDEZWE", "5566FT6N","6VPZ4T5P", "EHYXE34N", "6P4EF7BB", "TT56GTN2", "6YYPH399" ]
Expected result is
id = ["5566FT6N", "6LDFTLL9", "6P4EF7BB", "6RHSDD46", "6UVSCF4H", "6VPZ4T5P", "6YYPH399", "7SKDEZWE", "EHYXE34N", "N9RFERBG", "TT56GTN2" ]
You can utilize the .sort() method:
>>> id.sort()
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
This will sort the list in place. If you don't want to change the original id list, you can utilize the sorted() method
>>> sorted(id)
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
>>> id
['6LDFTLL9', 'N9RFERBG', '6RHSDD46', '6UVSCF4H', '7SKDEZWE', '5566FT6N', '6VPZ4T5P', 'EHYXE34N', '6P4EF7BB', 'TT56GTN2', '6YYPH399']
Notice, with this one, that id is unchanged.
For a DataFrame, you want to use sort_values().
df.sort_values(0, inplace=True)
0 is either the numerical index of your column or you can pass the column name (eg. id)
0
5 5566FT6N
0 6LDFTLL9
8 6P4EF7BB
2 6RHSDD46
3 6UVSCF4H
6 6VPZ4T5P
10 6YYPH399
4 7SKDEZWE
7 EHYXE34N
1 N9RFERBG
9 TT56GTN2
I've been trying to create a dictionary of data frames so I can store data coming from different files. I have one dataframe created in the following loop, and I would like to aggregate them to have each dataframe to the dictionary. I will have to join them later by the date.
d = {}
for num in range(3,14):
nodeName = "rgs" + str(num).zfill(2) #The key should be the nodeName
# Bunch of stuff to get the data ...
# Fill dataframe
data = {'date':date_list, 'users':users_list}
df = pd.DataFrame(data)
df = df.convert_objects(convert_numeric=True)
df = df.dropna(subset=['users'])
df['users'] = df['users'].astype(int)
d = {nodeName:df}
print d
The problem that I have is, if I print the dictionary out of the loop I only have one item, the last one.
{'rgs13': date users
0 2016-01-18 1
1 2016-01-19 1
2 2016-01-20 1
3 2016-01-21 1
4 2016-01-22 1
5 2016-01-23 1
6 2016-01-24 0
But I can clearly see that I can generate all the dataframes without problems inside the loop. How can I make the dictionary to keep all the df's? What am I doing wrong?
Thanks for the help.
It's because in the end you are re-defining d.
What you want is this:
d = {}
for num in range(3,14):
nodeName = "rgs" + str(num).zfill(2) #The key should be the nodeName
# Bunch of stuff to get the data ...
# Fill dataframe
data = {'date':date_list, 'users':users_list}
df = pd.DataFrame(data)
df = df.convert_objects(convert_numeric=True)
df = df.dropna(subset=['users'])
df['users'] = df['users'].astype(int)
d[nodeName] = df
print d
Instead of d = {nodeName:df} use
d[nodeName] = df
Since this adds a key/value pair to d whereas d = {nodeName:df} reassigns d to a new dict (with only the one key/value pair). Doing that in a loop spells death to all the previous key/value pairs.
You may find Ned Batchelder's Facts and myths about Python names and values a useful read. It will give you the right mental model for thinking about the relationship between variable names and values, and help you see what statements modify values (e.g. d[nodeName] = df) versus reassign variable names (e.g. d = {nodeName:df}).