Python: create edge list for bipartite graph using pandas - python

I have a simple file that lists texts by name and then words that are a part of that text:
text,words
ANC088,woods dig spirit controller father treasure_lost
ANC089,controller dig spirit
ANC090,woods ag_work tomb
ANC091,well spirit_seen treasure
Working with pandas I have this, albeit klugey solution for getting a list of nodes for the two sides of a bipartite graph, one side listing the texts and the other the words, in this case associated with the text:
import pandas as pd
df = pd.read_csv(open('tales-02.txt', 'r'))
node_list_0 = df['text'].values.tolist()
node_list_1 = filter(None, sorted(set(' '.join(df['words'].values.tolist()).split(' '))))
It ain't pretty, but it works, and it's fast enough for my small data set.
What I need is a list of edges between those two nodes. I can this in csv, but I can't figure out how to do this in pandas. Here's my working csv:
texts = csv.reader(open('tales-01.txt', 'rb'), delimiter=',', skipinitialspace=True)
for row in texts:
for item in row[1:]:
edge_list.append((row[0], item))
I should note that this version of the input is csv all the way:
ANC088,woods,dig,spirit,controller,father,treasure_lost
ANC089,controller,dig,spirit
I adjusted the file format to make it easier for me to write the pandas stuff -- if someone can also show me how to grab the node lists out of the pure csv file, that would be awesome.
I'd rather this be done either all csv or all pandas. I tried to write a script that would get me the node lists using csv but I kept getting an empty list. That's when I turned to pandas, which everyone tells me I should be using anyway.

The following code creates a DataFrame where with the text and the word columns from your file tales-01.txt. It's not very pretty (is there a prettier solution?), but it seems to do the job.
df = (pd.read_csv('tales-01.txt',header=None)
.groupby(level=0).apply(
lambda x : pd.DataFrame ([[x.iloc[0,0],v]
for v in x.iloc[0,1:]]))
.reset_index(drop=True)
.dropna()
.rename_axis({0:'text',1:'word'},axis=1)
)
Here is a second solution based on the same idea that uses zip instead of the for loop. It might be faster.
def my_zip(d):
t,w = d.iloc[0,0],d.iloc[0,1:]
return pd.DataFrame(zip([t]*len(w), w)).dropna()
df = (pd.read_csv('tales-01.txt',header=None)
.groupby(level=0)
.apply(my_zip)
.reset_index(drop=True)
.rename_axis({0:'text',1:'word'},axis=1)
)
The result is the same in both cases:
text word
0 ANC088 woods
1 ANC088 dig
2 ANC088 spirit
3 ANC088 controller
4 ANC088 father
5 ANC088 treasure_lost
6 ANC089 controller
7 ANC089 dig
8 ANC089 spirit

Related

How to split csv data

I have a problem where I got a csv data like this:
AgeGroup Where do you hear our company from? How long have you using our platform?
18-24 Word of mouth; Supermarket Product 0-1 years
36-50 Social Media; Word of mouth 1-2 years
18-24 Advertisement +4 years
and I tried to make the file into this format through either jupyter notebook or from excel csv:
AgeGroup Where do you hear our company from?
18-24 Word of mouth 0-1 years
18-24 Supermarket Product 0-1 years
36-50 Social Media 1-2 years
36-50 Word of mouth 1-2 years
18-24 Advertisement +4 years
Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook:
data = pd.read_csv('Untitled form.csv')
Can anyone tell me how should I do it?
I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column
Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252'
So it would become like this:
import pandas as pd
data_split = pd.read_csv('Untitled form split.csv',
skipinitialspace=True,
usecols=range(1,7),
encoding='cp1252')
However if there's a more efficient way, please let me know. Thanks
I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it.
import pandas as pd
data = pd.read_fwf('Untitled form.csv')
cols = data.columns
data_long = pd.DataFrame(columns=data.columns)
for idx, row in data.iterrows():
hear_from = row['Where do you hear our company from?'].split(';')
hear_from_fmt = list(map(lambda x: x.strip(), hear_from))
n_items = len(hear_from_fmt)
d = {
cols[0] : [row[0]]*n_items,
cols[1] : hear_from_fmt,
cols[2] : [row[2]]*n_items,
}
data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True)
Let's brake it down.
This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help.
Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together.
Now there might be a more efficient solution, but this should work.

Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?

newbie python learner here!
I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false.
Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? :
trialnum,colourtext,colourname,condition,response,rt,correct
1,blue,red,incongruent,red,0.767041,True
2,yellow,yellow,congruent,yellow,0.647259,True
3,green,blue,incongruent,blue,0.990185,True
4,green,green,congruent,green,0.720116,True
5,yellow,yellow,congruent,yellow,0.562909,True
6,yellow,yellow,congruent,yellow,0.538918,True
7,green,yellow,incongruent,yellow,0.693017,True
8,yellow,red,incongruent,red,0.679368,True
9,yellow,blue,incongruent,blue,0.951432,True
10,blue,blue,congruent,blue,0.633367,True
11,blue,green,incongruent,green,1.289047,True
12,green,green,congruent,green,0.668142,True
13,blue,red,incongruent,red,0.647722,True
14,red,blue,incongruent,blue,0.858307,True
15,red,red,congruent,red,1.820112,True
16,blue,green,incongruent,green,1.118404,True
17,red,red,congruent,red,0.798532,True
18,red,red,congruent,red,0.470939,True
19,red,blue,incongruent,blue,1.142712,True
20,red,red,congruent,red,0.656328,True
21,red,yellow,incongruent,yellow,0.978830,True
22,green,red,incongruent,red,1.316182,True
23,yellow,yellow,congruent,green,0.964292,False
24,green,green,congruent,green,0.683949,True
25,yellow,green,incongruent,green,0.583939,True
26,green,blue,incongruent,blue,1.474140,True
27,green,blue,incongruent,blue,0.569109,True
28,green,green,congruent,blue,1.196470,False
29,red,red,congruent,red,4.027546,True
30,blue,blue,congruent,blue,0.833177,True
31,red,red,congruent,red,1.019672,True
32,green,blue,incongruent,blue,0.879507,True
33,red,red,congruent,red,0.579254,True
34,red,blue,incongruent,blue,1.070518,True
35,blue,yellow,incongruent,yellow,0.723852,True
36,yellow,green,incongruent,green,0.978838,True
37,blue,blue,congruent,blue,1.038232,True
38,yellow,green,incongruent,yellow,1.366425,False
39,green,red,incongruent,red,1.066038,True
40,blue,red,incongruent,red,0.693698,True
41,red,blue,incongruent,blue,1.751062,True
42,blue,blue,congruent,blue,0.449651,True
43,green,red,incongruent,red,1.082267,True
44,blue,blue,congruent,blue,0.551023,True
45,red,blue,incongruent,blue,1.012258,True
46,yellow,green,incongruent,yellow,0.801443,False
47,blue,blue,congruent,blue,0.664119,True
48,red,green,incongruent,yellow,0.716189,False
49,green,green,congruent,yellow,0.630552,False
50,green,yellow,incongruent,yellow,0.721917,True
51,red,red,congruent,red,1.153943,True
52,blue,red,incongruent,red,0.571019,True
53,yellow,yellow,congruent,yellow,0.651611,True
54,blue,blue,congruent,blue,1.321344,True
55,green,green,congruent,green,1.159240,True
56,blue,blue,congruent,blue,0.861646,True
57,yellow,red,incongruent,red,0.793069,True
58,yellow,yellow,congruent,yellow,0.673190,True
59,yellow,red,incongruent,red,1.049320,True
60,red,yellow,incongruent,yellow,0.773447,True
61,red,yellow,incongruent,yellow,0.693554,True
62,red,red,congruent,red,0.933901,True
63,blue,blue,congruent,blue,0.726794,True
64,green,green,congruent,green,1.046116,True
65,blue,blue,congruent,blue,0.713565,True
66,blue,blue,congruent,blue,0.494177,True
67,green,green,congruent,green,0.626399,True
68,blue,blue,congruent,blue,0.711896,True
69,blue,blue,congruent,blue,0.460420,True
70,green,green,congruent,yellow,1.711978,False
71,blue,blue,congruent,blue,0.634218,True
72,yellow,blue,incongruent,yellow,0.632482,False
73,yellow,yellow,congruent,yellow,0.653813,True
74,green,green,congruent,green,0.808987,True
75,blue,blue,congruent,blue,0.647117,True
76,green,red,incongruent,red,1.791693,True
77,red,yellow,incongruent,yellow,1.482570,True
78,red,red,congruent,red,0.693132,True
79,red,yellow,incongruent,yellow,0.815830,True
80,green,green,congruent,green,0.614441,True
81,yellow,red,incongruent,red,1.080385,True
82,red,green,incongruent,green,1.198548,True
83,blue,green,incongruent,green,0.845769,True
84,yellow,blue,incongruent,blue,1.007089,True
85,green,blue,incongruent,blue,0.488701,True
86,green,green,congruent,yellow,1.858272,False
87,yellow,yellow,congruent,yellow,0.893149,True
88,yellow,yellow,congruent,yellow,0.569597,True
89,yellow,yellow,congruent,yellow,0.483542,True
90,yellow,red,incongruent,red,1.669842,True
91,blue,green,incongruent,green,1.158416,True
92,blue,red,incongruent,red,1.853055,True
93,green,yellow,incongruent,yellow,1.023785,True
94,yellow,blue,incongruent,blue,0.955395,True
95,yellow,yellow,congruent,yellow,1.303260,True
96,blue,yellow,incongruent,yellow,0.737741,True
97,yellow,green,incongruent,green,0.730972,True
98,green,red,incongruent,red,1.564596,True
99,yellow,yellow,congruent,yellow,0.978911,True
100,blue,yellow,incongruent,yellow,0.508151,True
101,red,green,incongruent,green,1.821969,True
102,red,red,congruent,red,0.818726,True
103,yellow,yellow,congruent,yellow,1.268222,True
104,yellow,yellow,congruent,yellow,0.585495,True
105,green,green,congruent,green,0.673404,True
106,blue,yellow,incongruent,yellow,1.407036,True
107,red,red,congruent,red,0.701050,True
108,red,green,incongruent,red,0.402334,False
109,red,green,incongruent,green,1.537681,True
110,green,yellow,incongruent,yellow,0.675118,True
111,green,green,congruent,green,1.004550,True
112,yellow,blue,incongruent,blue,0.627439,True
113,yellow,yellow,congruent,yellow,1.150248,True
114,blue,yellow,incongruent,yellow,0.774452,True
115,red,red,congruent,red,0.860966,True
116,red,red,congruent,red,0.499595,True
117,green,green,congruent,green,1.059725,True
118,red,red,congruent,red,0.593180,True
119,green,yellow,incongruent,yellow,0.855915,True
120,blue,green,incongruent,green,1.335018,True
But I am only interested in the 'condition', 'rt', and 'correct' columns.
I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table:
Participant
Stimulus Type
Mean Reaction Time
Percentage Correct
01
Congruent
0.560966
80
01
Incongruent
0.890556
64
02
Congruent
0.460576
89
02
Incongruent
0.956556
55
Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice!
I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants?
Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table:
import os
import glob
import pandas as pd
#set working directory
os.chdir('data')
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table?
Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes?
I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.
Try this:
from pathlib import Path
# Use the Path class to represent a path. It offers more
# functionalities when perform operations on paths
path = Path("./data").resolve()
# Create a dictionary whose keys are the Participant ID
# (the `01` in `P01.csv`, etc), and whose values are
# the data frames initialized from the CSV
data = {
p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv")
}
# Create a master data frame by combining the individual
# data frames from each CSV file
df = pd.concat(data, keys=data.keys(), names=["participant", None])
# Calculate the statistics
result = (
df.groupby(["participant", "condition"]).agg(**{
"Mean Reaction Time": ("rt", "mean"),
"correct": ("correct", "sum"),
"size": ("trialnum", "size")
}).assign(**{
"Percentage Correct": lambda x: x["correct"] / x["size"]
}).drop(columns=["correct", "size"])
.reset_index()
)

ways to improve efficiency of Python script

I have a list of genes, their coordinates, and their expression (right now just looking at the top 500 most highly expressed genes) and 12 files corresponding to DNA reads. I have a python script that searches for reads overlapping with each gene's coordinates and storing the values in a dictionary. I then use this dictionary to create a Pandas dataframe and save this as a csv. (I will be using these to create a scatterplot.)
The RNA file looks like this (the headers are gene name, chromosome, start, stop, gene coverage/enrichment):
MSTRG.38 NC_008066.1 9204 9987 48395.347656
MSTRG.36 NC_008066.1 7582 8265 47979.933594
MSTRG.33 NC_008066.1 5899 7437 43807.781250
MSTRG.49 NC_008066.1 14732 15872 26669.763672
MSTRG.38 NC_008066.1 8363 9203 19514.273438
MSTRG.34 NC_008066.1 7439 7510 16855.662109
And the DNA file looks like this (the headers are chromosome, start, stop, gene name, coverage, strand):
JQ673480.1 697 778 SRX6359746.5505370/2 8 +
JQ673480.1 744 824 SRX6359746.5505370/1 8 -
JQ673480.1 1712 1791 SRX6359746.2565519/2 27 +
JQ673480.1 3445 3525 SRX6359746.7028440/2 23 -
JQ673480.1 4815 4873 SRX6359746.6742605/2 37 +
JQ673480.1 5055 5092 SRX6359746.5420114/2 40 -
JQ673480.1 5108 5187 SRX6359746.2349349/2 24 -
JQ673480.1 7139 7219 SRX6359746.3831446/2 22 +
The RNA file has >9,000 lines, and the DNA files have > 12,000,000 lines.
I originally had a for-loop that would generate a dictionary containing all values for all 12 files in one go, but it runs extremely slowly. Since I have access to a computing system with multiple cores, I've decided to run a script that only calculates coverage one DNA file at a time, like so:
#import modules
import csv
import pandas as pd
import matplotlib.pyplot as plt
#set sample name
sample='CON-2'
#set fraction number
f=6
#dictionary to store values
d={}
#load file name into variable
fileRNA="top500_R8_7-{}-RNA.gtf".format(sample)
print(fileRNA)
#read tsv file
tsvRNA = open(fileRNA)
readRNA = csv.reader(tsvRNA, delimiter="\t")
expGenes=[]
#convert tsv file into Python list
for row in readRNA:
gene=row[0],row[1],row[2],row[3],row[4]
expGenes.append(row)
#print(expGenes)
#establish file name for DNA reads
fileDNA="D2_7-{}-{}.bed".format(sample,f)
print(fileDNA)
tsvDNA = open(fileDNA)
readDNA = csv.reader(tsvDNA, delimiter="\t")
#put file into Python list
MCNgenes=[]
for row in readDNA:
read=row[0],row[1],row[2]
MCNgenes.append(read)
#find read counts
for r in expGenes:
#include FPKM in the dictionary
d[r[0]]=[r[4]]
regionCount=0
#set start and stop points based on transcript file
chr=r[1]
start=int(r[2])
stop=int(r[3])
#print("start:",start,"stop:",stop)
for row in MCNgenes:
if start < int(row[1]) < stop:
regionCount+=1
d[r[0]].append(regionCount)
n+=1
df=pd.DataFrame.from_dict(d)
#convert to heatmap
df.to_csv("7-CON-2-6_forHeatmap.csv")
This script also runs quite slowly, however. Are there any changes I can make to get it run more efficiently?
If I understood correctly and you are trying to match between coordinates of genes in different files I believe the best option would be to use something like KDTree partitioning algorithm.
You can use KDtree to partition your DNA and RNA data. I'm assumming you're using 'start' and 'stop' as 'coordinates':
import pandas as pd
import numpy as np
from sklearn.neighbors import KDTree
dna = pd.DataFrame() # this is your dataframe with DNA data
rna = pd.DataFrame() # Same for RNA
# Let's assume you are using 'start' and 'stop' columns as coordinates
dna_coord = dna.loc[:, ['start', 'stop']]
rna_coord = rna.loc[:, ['start', 'stop']]
dna_kd = KDTree(dna_coord)
rna_kd = KDTree(rna_coord)
# Now you can go through your data and match with DNA:
my_data = pd.DataFrame()
for start, stop in zip(my_data.start, my_data.stop):
coord = np.array(start, stop)
dist, idx = dna_kd.query(coord, k=1)
# Assuming you need an exact match
if np.islose(dist, 0):
# Now that you have the index of the matchin row in DNA data
# you can extract information using the index and do whatever
# you want with it
dna_gene_data = dna.loc[idx, :]
You can adjust your search parameters to get the desired results, but this will be much faster than searching every time.
Generally, Python is extremely extremely easy to work with at the cost of it being inefficient! Scientific libraries (such as pandas and numpy) help here by only paying the Python overhead a minimum limited number of times to map the work into a convenient space, then doing the "heavy lifting" in a more efficient language (which may be quite painful/inconvenient to work with).
General advice
try to get data into a dataframe whenever possible and keep it there (do not convert data into some intermediate Python object like a list or dict)
try to use methods of the dataframe or parts of it to do work (such as .apply() and .map()-like methods)
whenever you must iterate in native Python, iterate on the shorter side of a dataframe (ie. there may be only 10 columns, but 10,000 rows ; go over the columns)
More on this topic here:
How to iterate over rows in a DataFrame in Pandas?
Answer: DON'T*!
Once you have a program, you can benchmark it by collecting runtime information. There are many libraries for this, but there is also a builtin one called cProfile which may work for you.
docs: https://docs.python.org/3/library/profile.html
python3 -m cProfile -o profile.out myscript.py

Sorting in pandas by multiple column without distorting index

I have just started out with Pandas and I am trying to do a multilevel sorting of data by columns. I have four columns in my data: STNAME, CTYNAME, CENSUS2010POP, SUMLEV. I want to set the index of my data by columns: STNAME, CTYNAME and then sort the data by CENSUS2010POP. After I set the index the appears like in pic 1 (before sorting by CENSUS2010POP) and when I sort and the data appears like pic 2 (After sorting). You can see Indices are messy and no longer sorted serially.
I have read out a few posts including this one (Sorting a multi-index while respecting its index structure) which dates back to five years ago and does not work while I write them. I am yet to learn the group by function.
Could you please tell me a way I can achieve this?
ps: I come from a accounting/finance background and very new to coding. I have just completed two Python course including PY4E.com
used this below code to set the index
census_dfq6 = census_dfq6.set_index(['STNAME','CTYNAME'])
and, used the below code to sort the data:
census_dfq6 = census_dfq6.sort_values (by = ['CENSUS2010POP'], ascending = [False] )
sample data I am working, I would love to share the csv file but I don't see a way to share this.
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50
Required End Result:
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50

Python Sorting and Organising

I'm trying to sort data from a file and not quiet getting what i need. I have a text file with race details ( name placement( ie 1,2,3). I would like to be able to organize the data by highest placement first and also alphabetically by name. I can do this if i split the lines but then the name and score will not match up.
Any help and suggestion would be very welcomed, I've hit that proverbial wall.
My apologies ( first time user for this site , and python noob, steep learning curve ) Thank you for your suggestions , i really do appreciate the help.
comp=[]
results = open('d:\\test.txt', 'r')
for line in results:
line=line.split()
# (name,score)= line.split()
comp.append(line)
sorted(comp)
results.close()
print (comp)
Test file was in this format:
Jones 2
Ranfel 7
Peterson 5
Smith 1
Simons 9
Roberts 4
McDonald 3
Rogers 6
Elliks 8
Helm 10
I completely agree with everyone who has down-voted this question for being badly posed. However, I'm in a good mood so I'll try and at least steer you in the right direction:
Let's assume your text file looks like this:
Name,Placement
D,1
D,2
C,1
C,3
B,1
B,3
A,1
A,4
I suggest importing the data and sorting it using Pandas http://pandas.pydata.org/
import pandas as pd
# Read in the data
# Replace <FULL_PATH_OF FILE> with something like C:/Data/RaceDetails.csv
# The first row is automatically used for column names
data=pd.read_csv("<FULL_PATH_OF_FILE>")
# Sort the data
sorted_data=data.sort(['Placement','Name'])
# Create a re-indexed data frame if you so desire
sorted_data_new_index=sorted_data.reset_index(drop=True)
This gives me:
Name Placement
A 1
B 1
C 1
D 1
D 2
B 3
C 3
A 4
I'll leave you to figure out the rest..
As #Jack said, I am very limited to how I can help if you don't post code or the txt file. However, I've run into a similar problem before, so I know the basics (again, will need code/files before I can give an exact type-this-stuff answer!)
You can either develop an algorithm yourself, or use the built-in sorted feature
Put the names and scores in a list (or dictionary) such as:
name_scores = [['Matt', 95], ['Bob', 50], ['Ashley', 100]]
and then call sorted(name_scores) and it will sort by names: [['Ashley', 100], ['Bob', 50], ['Matt', 95]]

Categories