Let's say I have a list of names like this one in a csv:
Nom;Link;NonLink
Deb;John;
John;Deb;
Martha;Travis;
Travis;Martha;
Allan;;
Lois;;
Jayne;;
Brad;;Abby
Abby;;Brad
I imported it using numpy:
import numpy as np
file = np.genfromtxt('liste.csv', dtype=None, delimiter =';',skip_header=1)
Now, I'm isolating my first column:
Nom = np.array(file[:,0])
I would like to create a matrix using only this first column to get a result like this one:
Deb John Martha etc...
Deb 0 0 0 ...
John 0 0 0 ...
Martha 0 0 0 ...
etc...
Is there a numpy function for that?
Edit: My end goal is to make a little program to assign seats at tables where people in Link must be seated at the same table and NonLink must not be at the same table.
Thank you,
You can use pandas and create a dataframe using Nom variable.
Something like this:
import pandas as pd
df = pd.DataFrame([[0] * len(Nom)] * len(Nom), Nom, Nom)
print(df)
Related
details about the goal: I'm learning basic ML and im tasked with finding the best match between some raw city names and some normalized city names.
Expected result: The idea is to find items that are similar as according to the Levenshtein distance, and output the best match on the right column of the raw data df.
What i did: Originally, I made a nested loop that compares the first row with the 36k thousand rows, output the smallest and its index, and store that in the right most column. I quickly concluded that it's not a best practice because you're not supposed to loop over pandas df, and the complexity of 10000^36k was just way too much. After some search, i found the following code, which is supposed to work properly:
rawdata["Best match"]=rawdata["city"].map(lambda x: process.extractOne(x, normadata["city"])[0])
Sadly this algo has been running for an hour on my computer so i dont think it does the job still. What would yall recommend to do this quicker ?
Thank you for any time you would spend on this.
#import libraries
import pandas as pd
import sys
!pip install fuzzywuzzy
!pip install python-Levenshtein
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def import_data(file):
return pd.read_csv(file, header=0, dtype=str)
rawdata = import_data("raw_cities.csv")
rawdata['city']=rawdata['city'].map(str)
normadata = import_data("normalized_cities.csv")
normadata['city']=normadata['city'].map(str)
rawdata["Best match"]=rawdata["city"].map(lambda x: process.extractOne(x, normadata["city"])[0])
#my tables look as such
city
0 cleron
1 aveillans
2 paray-vieille-poste
3 issac
4 rians
9995 neuville les dieppe |
9996 saint andre de vezines
9997 saint-germain-de-lusignan
9998 bergues-sur-sambre
9999 santa-maria-figaniella
[10000 rows x 1 columns]
city
0 abergement clemenciat
1 abergement de varey
2 amberieu en bugey
3 amberieux en dombes
4 ambleon
35352 m'tsangamouji
35353 ouangani
35354 pamandzi
35355 sada
35356 tsingoni
[35357 rows x 1 columns]
first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.
I have a wireless radio readout that basically dumps all of the data into one column (column 'A') a of a spreadsheet (.xlsx). Is there anyway to parse the twenty plus columns into a dataframe for pandas? This is example of the data that is in column A of the excel file:
DSP ALLMSINFO:SECTORID=0,CARRIERID=0;
Belgium351G
+++ HUAWEI 2020-04-03 10:04:47 DST
O&M #4421590
%%/*35687*/DSP ALLMSINFO:SECTORID=0,CARRIERID=0;%%
RETCODE = 0 Operation succeeded
Display Information of All MSs-
------------------------------
Sector ID Carrier ID MSID MSSTATUS MSPWR(dBm) DLCINR(dB) ULCINR(dB) DLRSSI(dBm) ULRSSI(dBm) DLFEC ULFEC DLREPETITIONFATCTOR ULREPETITIONFATCTOR DLMIMOFLAG BENUM NRTPSNUM RTPSNUM ERTPSNUM UGSNUM UL PER for an MS(0.001) NI Value of the Band Where an MS Is Located(dBm) DL Traffic Rate for an MS(byte/s) UL Traffic Rate for an MS(byte/s)
0 0 0011-4D10-FFBA Enter -2 29 27 -56 -107 21 20 0 0 MIMO B 2 0 0 0 0 0 -134 158000 46000
0 0 501F-F63B-FB3B Enter 13 27 28 -68 -107 21 20 0 0 MIMO A 2 0 0 0 0 0 -134 12 8
Basically I just want to parse this data and have the table in a dataframe. Any help would be greatly appreciated.
You could try pandas read excel
df = pd.read_excel(filename, skip_rows=9)
This assumes we want to ignore the first 9 rows that don't make up the dataframe! Docs here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
Load the excel file and split the column on the spaces.
A problem may occur with "DLMIMOFLAG" because it has a space in the data and this will cause it to be split over two columns. It's optional whether this is acceptable or if the columns are merged back together afterwards.
Add the header manually rather than load it, otherwise all the spaces in the header will confuse the loading & splitting routines.
import numpy as np
import pandas as pd
# Start on the first data row - row 10
# Make sure pandas knows that only data is being loaded by using
# header=None
df = pd.read_excel('radio.xlsx', skiprows=10, header=None)
This gives a dataframe that is only data, all held in one column.
To split these out, make sure pandas has a reference to the first column with df.iloc[:,0], split the column based on spaces with str.split() and inform pandas the output will be a numpy list values.tolist().
Together this looks like:
df2 = pd.DataFrame(df.iloc[:,0].str.split().values.tolist())
Note the example given has an extra column because of the space in "DLMIMOFLAG" causing it to be split over two columns. This will be referred to as "DLMIMOFLAG_A" and "DLMIMOFLAG_B".
Now add on the column headers.
Optionally create a list first.
column_names = ["Sector ID", "Carrier ID", "MSID", "MSSTATUS", "MSPWR(dBm)", "DLCINR(dB)", "ULCINR(dB)",
"DLRSSI(dBm)", "ULRSSI(dBm)", "DLFEC", "ULFEC", "DLREPETITIONFATCTOR", "ULREPETITIONFATCTOR",
"DLMIMOFLAG_A", "DLMIMOFLAG_B", "BENUM", "NRTPSNUM", "RTPSNUM", "ERTPSNUM", "UGSNUM",
"UL PER for an MS(0.001)", "NI Value of the Band Where an MS Is Located(dBm)",
"DL Traffic Rate for an MS(byte/s)", "UL Traffic Rate for an MS(byte/s)",]
df2.columns = column_names
This gives the output as a full dataframe with column headers.
Sector ID Carrier ID MSID MSSTATUS
0 0 0011-4D10-FFBA Enter
0 0 501F-F63B-FB3B Enter
im trying to save some data in a dataframe, the first row of the dataframe should be ('Tom',.99, 'tom2'), supose i need to add ('mart',.3, 'mart2') row to the dataframe , i've tried to use append but is adding nothing this is my code
import pandas as pd
trackeds = {'Name':['Tom'], 'proba':[.99],'name2':['tom2']}
df_trackeds = pd.DataFrame(trackeds)
df_trackeds.append(pd.DataFrame({'name':['mart'],'proba': [.3],'name2':['mart2']}))
print(df_trackeds)
the output is
Name proba name2
0 Tom 0.99 tom2
i also tried to use
df_trackeds.append({'name':['mart'],'proba': [.3],'name2':['mart2']},ignore_index=True)
and
df_trackeds.append(pd.DataFrame({'name':['mart'],'proba': [.3],'name2':['mart2']}))
but nothing, i hope you can help me, thanks in advance
Pandas function DataFrame.append not working inplace like pure python append, so is necessary assign back:
df = pd.DataFrame({'Name':['mart'],'proba': [.3],'name2':['mart2']})
df_trackeds = df_trackeds.append(df, ignore_index=True)
print(df_trackeds)
Name proba name2
0 Tom 0.99 tom2
1 mart 0.30 mart2
My task is to create a friendship matrix (user-user matrix), which values are 1, if the users are friends, and 0, if not. My .csv file have 1,5 million rows, so I create the following little csv to test my algorithm:
user_id friends
Elena Peter, John
Peter Elena, John
John Elena, Peter, Chris
Chris John
For this little csv, my code works well:
%matplotlib inline
import pandas as pd
import seaborn as sns
import numpy as np
from scipy import sparse
sns.set(style="darkgrid")
user_filepath = 'H:\\YelpData\\test.csv' # this is my little test file
df = pd.read_csv(user_filepath, usecols=['user_id','friends'])
def Convert_String_To_List(string):
if string!="None":
li = list(string.split(", "))
else:
li = []
return li
friend_map = {}
for i in range(len(df)): #storing friendships in map
friend_map[df['user_id'][i]] = Convert_String_To_List(df['friends'][i])
users = sorted(friend_map.keys())
user_indices = dict(zip(users, range(len(users)))) #giving indices for users
#and now the sparsity matrix:
row_ind = [] #row indices, where the value is 1
col_ind = [] #col indices, where the value is 1
data = [] # value 1
for user in users:
for barat in baratok[user]:
row_ind.append(user_indices[user])
col_ind.append(user_indices[barat])
for i in range(len(row_ind)):
data.append(1)
mat_coo = sparse.coo_matrix((data, (row_ind, col_ind)))
friend_matrix = mat_coo.toarray() #this friendship matrix is good for the little csv file
But when I try this code to my large (1,5 million rows) csv, I get memory error, when I want to store friendships in map (in the for cycle).
Is there any solution for this?
I think you are approaching this the wrong way, you should use pandas and vectorized operation as possible to account for the large data you have.
This is a complete pandas approach depending on your data.
import pandas as pd
_series = df1.friends.apply(lambda x: pd.Series(x.split(', '))).unstack().dropna()
data = pd.Series(_series.values, index=_series.index.droplevel(0))
pd.get_dummies(data).groupby('user_id').sum()
Output
Chris Elena John Peter
user_id
Chris 0 0 1 0
Elena 0 0 1 1
John 1 1 0 1
Peter 0 1 1 0
BTW, this can be further optimized and through using pandas you avoid using memory-expensive for loops and you can use chunksize to chunk your data for furthere optimization.
I think you should not store the string repeatedly. You need to make a list of name and store the index of the name, not the name itself. This part of the code:
friend_map[df['user_id'][i]] = Convert_String_To_List(df['friends'][i])
can be changed. If you have a list of users,
users = [....] # read from csv
friend_list = Convert_String_To_List(df['friends'][i])
friend_list_idxs = Get_Idx_of_Friends(users,friend_list) #look up table users
friend_map[df['user_id'][i]] = friend_list_idxs
This way, you will not need to store same string repeatedly.
Let's say you have 10 million friend relationship, you will need to store 10MB of memory.