I am somewhat of a beginner to python and have encountered the following problem working with openpyxl. For example I have the sample worksheet below:
Worksheet
| Boat ID | Emp ID | Emp Name | Start Date | Manager |
------------------------------------------------------
| 1 16044 Derrick ASAP Anthony |
| 1 16045 John ASAP Anthony |
| 1 16046 Bill ASAP Anthony |
| 1 16047 Joe ASAP Anthony |
| 2 16048 Justin ASAP Jacob |
| 2 16049 Sandy ASAP Jacob |
| 2 16050 Omar ASAP Jacob |
| 3 16051 Michael ASAP Nathan |
| 3 16052 Bill ASAP Nathan |
What I am trying to do is loop through the Boat ID column and while the values of the cell are the equivalent I want to take the respective row data to the right and open a new worksheet/workbook and copy paste rows in Col B:E.
So in theory, for every Boat ID = 1 we would take every row unique to ID 1 from Cols B:E open a new workbook and paste them accordingly. Next, for every Boat ID = 2 we would take the rows with ID = 2 in cols B:E, open a new workbook and paste accordingly. Similarly, we would repeat the process for every Boat ID = 3.
P.S. To keep it simple I have ordered the table by Boat ID in ascending order, but if someone wants bonus points they could opine on how it would be done if the table was not ordered.
Any help here would be appreciated as I am still learning and a complex problem like this would be beneficial to further enhance my skills.
I know I am way off but this is the logic I have so far.
f
WS
I used a dictionary in order to categorize all of the boats that I read from the original file according to their "boat id"
the keys were the boats IDs
and the values were the boats
as you can see the code will work even if the original XML file is not sorted by the boat IDs
wb = openpyxl.load_workbook("boats.xlsx", read_only=True)
ws = wb.active
boat_dict = {}
for row_index in range(1,ws.max_row+1):
row = [cell.value for cell in ws[row_index]]
boat_id = row[0]
if boat_id in boat_dict:
boat_dict[boat_id].append(row[1:])
else:
boat_dict[boat_id] = [row[1:]]
new_wb=openpyxl.Workbook()
for boat_id , boats in boat_dict.items():
ws = new_wb.create_sheet(title="Boat id %s"%boat_id)
for boat in boats:
ws.append(boat)
new_wb.save("boats_ans.xlsx")
hope I could help :)
I am working on dataframe with python.
in my first dataframe df1 i have :
+------+---------+-------------+-------------------------------+
| ID | PUBLICATION TITLE | DATE | JOURNAL |
+------+---------------------+--------------+------------------+
| 1 "a" "01/10/2000" "book1" |
| 2 "b" "09/03/2005" NaN |
| NaN "b" "09/03/2005" "book2 |
| 5 "z" "21/08/1995" "book4" |
| 6 "n" "15/04/1993" "book9\xc3\x28" |
+--------------------------------------------------------------+
Here I would like to clean my dataframe but I don't know how to do it in this case.
Indeed there are two points which block me.
The first one is that lines 2 and 3 seems to be the same line because the title of the publication is the same and because I think that the title of the publication is unique to a newspaper
The second point is for the last line one to \xc3\x28.
How can I clean my dataframe in a smart way, to be able to use this code for other daataframe if possible?
First you should remove the row with ID = NaN. This can be done by:
df1 = df1[df1['ID'].notna()]
Then update the journal of the 2nd row:
df1.iloc[1, df1.columns.get_loc('JOURNAL')] = 'book2'
Finally, for the entry of 'book9\xc3\x28', you can update it by:
df1.iloc[4, df1.columns.get_loc('JOURNAL')] = 'book9'
What type of encoding are you using.
I recommend using "utf8" encoding for this purpose.
I have a two different datasets:
users's "taste" table:
+-------+---------+--------+---------+---------+-----+--
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+---------+---------+--------+-----+--
| 100 | 0 | 1 | 2 | 1 | 0 |
| 101 | 1 | 4 | 0 | 3 | 0 |
+-------+---------+---------+---------+--------+-----+--
movie's genre table:
+--------+---------+---- ----+---------+---------+-----+--
|movie_id| Action |Adventure|Animation| Children|Drama|
+--------+---------+---- ----+---------+---------+-----+--
| 1001 | 0 | 1 | 1 | 1 | 0 |
| 1001 | 0 | 1 | 0 | 1 | 0 |
+--------+---------+---------+---------+---------+-----+--
I am trying to recommend to user the most similar N movies based on his taste. What I thought is to measure the similarity distance (cosine similarity/dot product) between user and each movie and return top N most similar ones. What is the right way to implement it in python?
It is a easy questions but depending on type of distance & size of data the answer can be complex. I give you some hooks to start with.
Sklearn has distance metrics implemented, which you can use right away to calculate the distance between items, and for instance with help of argmax find the best match. This would be the naive approach, but works fine on small data sets and you have flexibility to use any metric you want.
Distances have the pairwise implemented, made to easily calculate distance matrices to quickly find the best match. But you can imagine for larger data-sets this strategy wont work anymore.
When the data grows- you can use the BallTree algorithm to quickly find the (1) k-nearest, or (2) all movies within certain threshold. This algorithm is well implemented in sklearn, and I would advise to start with this approach, as it is a good balance between fast and easy to implement.
Other option is to use specialised package like faiss or ann. Both would only use if above fails in terms of speed/size of data.
x=df.groupby(['id_gamer'])[['sucess', 'nb_games']].shift(periods=1).cumsum()
.apply(lambda row: row.sucess/row.nb_games, axis=1)
In the code above, I make a groupby on a pandas.DataFrame in order to obtain a shifted column of results represented as ratio, for each gamer, and each game. Actually his rate of success considering the number of games he played.
It returns a pandas.core.series.Series object as:
+---------------+----------------+
| Index | Computed_ratio |
+---------------+----------------+
| id_game_date | NaN |
| id_game2_date | 0.30 |
| id_game3_date | 0.40 |
| id_game_date | NaN |
| id_game4_date | 0.50 |
| ... | ... |
+---------------+----------------+
So, you may see the NaN as the delimitation between gamers. As you may see the first gamer and the second one met in one game: id_game_date. And this is why I would prefer the column of gamer from id_gamer to appear in order to merge it with the dataframe where data are from.
To be honest I have an idea of solution: just do not use the id of games as index, then each row will be indexed correctly and there is no conflict when I proceed a merge, I guess. But I would like to know if it is possible with this current pattern shown here.
NB: I already tried with the solutions presented in this topic. But none of these work, certainly because the functions shown are aggregations and not mine: cumsum(). If I used an aggregating function like sum() (with a different pattern of code, do not try with the one I gave you or it will return an error) the id_gamer appears. But it is not corresponding to my expectations.
I have a large dataframe containing a 'Description' column.
I've compiled a sizeable dictionary of lists, where the key is basically the Category, and the items are lists of possible (sub)strings contained in the description column.
I want to use the dictionary to classify each entry in the dataframe based on this description... Unfortunately I can't figure out how to apply a dictionary of lists to map to a dataframes (feels like it would be some sort of concoction of map, isin and str.contains but I have had no joy). I've included code to generate a model dataset below:
df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))
df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']
This model dataset would then ideally be somehow mapped against the following dictionary:
dict = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
'Animal':['Dog','Horse']}
to generate a new column in the dataframe, with the result as such:
| | A | Description | Type |
|---|----------------------|------------------------|--------|
| 0 | -1.4120290137842615 | White Ford Escort | Car |
| 1 | -0.3141036399049358 | Irish Draft Horse | Animal |
| 2 | 0.49374344901643896 | Springer spaniel (dog) | Animal |
| 3 | 0.013654965767323723 | Green Vauxhall Corsa | Car |
| 4 | -0.18271952280002862 | White Van | Car |
| 5 | 0.9519081000007026 | Labrador dog | Animal |
| 6 | 0.403258571154998 | Black horse | Animal |
| 7 | -0.8647792960494813 | Blue Van | Car |
| 8 | -0.12429427259820519 | Red Vauxhall Corsa | Car |
| 9 | 0.7695980616520571 | Bear | - |
The numbers are obviously irrelevant here, but there are other columns in the dataframes and I wanted this reflecting.
I'm happy to use regex, or a perhaps change my dictionary to a dataframe and do a join (i've considered multiple routes).
This feels similar to a recent question, but it's not the same and certainly the answer hasn't helped me.
Sorry if I've been stupid somewhere and this is really simple - it does feel like it should be, but i'm missing something.
Thanks
You can use fuzzywuzzy library to solve this. Make sure to install it via pip install fuzzywuzzy
from fuzzywuzzy import process
df = pd.DataFrame(np.random.randn(10, 1), columns=list('A'))
df['Description'] = ['White Ford Escort', 'Irish Draft Horse', 'Springer \
spaniel (dog)', 'Green Vauxhall Corsa', 'White Van', 'Labrador dog',\
'Black horse' ,'Blue Van','Red Vauxhall Corsa','Bear']
d = {'Car':['Ford Escort','Vauxhall Corsa','Van'],
'Animal':['Dog','Horse']}
# Construct a dataframe from the dictionary
df1 = pd.DataFrame([*d.values()], index=d.keys()).T.melt().dropna()
# Get relevant matches using the library.
m = df.Description.apply(lambda x: process.extract(x, df1.value)[0])
# concat the matches with original df
df2 = pd.concat([df, m[m.apply(lambda x: x[1]>80)].apply(lambda x: x[0])], axis=1)
df2.columns = [*df.columns, 'matches']
# After merge it with df1
df2 = df2.merge(df1, left_on='matches', right_on='value', how='left')
# Drop columns that are not required and rename.
df2 = df2.drop(['matches','value'],1).rename(columns={'variable':'Type'})
print (df2)
A Description Type
0 -0.423555 White Ford Escort Car
1 0.294092 Irish Draft Horse Animal
2 1.949626 Springer spaniel (dog) Animal
3 -1.315937 Green Vauxhall Corsa Car
4 -0.250184 White Van Car
5 0.186645 Labrador dog Animal
6 -0.052433 Black horse Animal
7 -0.003261 Blue Van Car
8 0.418292 Red Vauxhall Corsa Car
9 0.241607 Bear NaN
Consider inverting your dictionary first, while making everything lowercase
Then per row, split Description into words and make them lowercase
e.g., 'Springer spaniel (dog)' -> ['springer', 'spaniel', '(', 'dog', ')']
For each lower case word from (2), look it up in the inverted dictionary from (1); using apply