How to sort a csv file by column

How to sort a csv file by column - python

I need to sort a .csv file in a very specific way but have pretty limited knowledge of python, i have got some code that works but it doesnt really do exactly what i want it to do, the format is as follows {header} {header} {header} {header}
{dataA} {dataB} {datac} {dataD}
In the csv whatever dataA is it is usually repeated 100-200 times, is there a way in which i can get dataA (e.g: examplecompany) and tell me how many times it repeats then how many times dataC repeats with dataA as the first item in the row. for example the output might be examplecompany appeared 100 times, out of those 100 datac1 appeared 45 times and datac2 appeared 55 I'm really terrible at explaining things, any help would be appreciated.

You can use csv.DictReader to read the file and then sort for the key you want.
from csv import DictReader
with open("test.csv") as f:
reader = DictReader(f)
sorted_rows = sorted(list(reader), key=lambda x: x["column1"])
CSV file I tested it with (test.csv):
column1,column2
2,bla
1,blubb

It is not clear what do you want to accomplish since you have not provided any code or a complete example of input/output for your problem.
For me, it seems that you want to count certain occurrences of data in headerC for each unique data in headerA.
Suppose you have the following .csv file:
headerA,headerB,headerC,headerD
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany2,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac3,datad
You can accomplish this counting with pandas. Following is an example of how you might do it.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df.groupby(['headerA'])['headerC'].value_counts()
headerA headerC
examplecompany1 datac1 3
datac2 2
datac3 1
examplecompany2 datac2 2
datac1 1
Name: headerC, dtype: int64
Here, groupby will group the DataFrame using headerA as a reference. You can group by a single Series or a list of Series. After that, the square bracket notation is used to access the headerC column and value_counts will count each occurrence of headerC that was previously grouped by headerA. Afterwards you can just format the output for what you want.
Edit:
I forgot that you also wanted to get the number of occurrences of headerA, but that is really simple since you can get it directly by selecting the headerA column on the DataFrame df and call value_counts on it.

Related

Pick a list of values from one CSV and get the count of the values of the list in a different CSV

i am working on python code to calculate the occurrences of few values in a column within a CSV.
Example - CSV1 is as below
**Type Value**
Simple test
complex problem
simple formula
complex theory
simple idea
simple task
I need to get the content of value for type simple and complex i.e
**Type Value**
simple test
simple formula
simple idea
simple task
complex theory
complex problem
And query other CSV which is CSV1 on the total count of occurrences of simple list i.e [test, formula, idea, task] and complex list i.e [theory, problem]
Other CSV2 is
**Category**
test
test
test
formula
formula
formula
test
test
idea
task
task
idea
task
idea
task
problem
problem
theory
problem
problem
idea
task
problem
test
Both CSV1 and CSV2 are dynamic, from CSV1 as example for type "simple' get the list of the corresponding values and refer CSV2 to know what's count for each value. i.e counts of test, idea, task, formula.
Same for Complex type
I tried multiple methods with pandas but not expecting result as expected. Any pointers please.

Use:
df2['cat'] = df2['Category'].map(df1.set_index('Value')['Type'])
df2 = df2['cat'].value_counts().rename_axis('a').reset_index(name='b')
print (df2)
a b
0 simple 18
1 complex 6

Much like #jezrael,however I would first groupby the second csv. This would help in merging if the second csv is very large.
df2=cv2.groupby('value').agg(cnt=('value','count')).reset_index()
This would give me a dataframe with two columns, value and count.
Now, you can merge it with CV1
df1 = cv1.merge(df2,on=['value'],how='inner')

Find a value from a dataframe in another dataframe and print the row (Python)

I have posted a similar question before, but I thought it'd be better to elaborate it in another way. For example, I have a dataframe of compounds assigned to a number, as it follows:
compound,number
17alpha_beta_SID_24898755,8
2_prolinal_109328,3
4_chloro_4491,37
5HT_144234_01,87
5HT_144234_02,2
6-OHDA_153466,23
Also, there is another dataframe with other properties, as well as the compound names, but not only with its corresponding numbers, there are rows in which the compound names are assigned to different numbers - these cases where there are differences are not of interest:
rmsd,chemplp,plp,compound,number
1.00,14.00,-25.00,17alpha_beta_SID_24898755,7
0.38,12.00,-19.00,17alpha_beta_SID_24898755,8
0.66,16.00,-25.6,17alpha_beta_SID_24898755,9
0.87,24.58,-38.35,2_prolinal_109328,3
0.17,54.58,-39.32,2_prolinal_109328,4
0.22,22.58,-32.35,2_prolinal_109328,5
0.41,45.32,-37.90,4_chloro_4491,37
0.11,15.32,-37.10,4_chloro_4491,38
0.11,15.32,-17.90,4_chloro_4491,39
0.61,38.10,-45.86,5HT_144234_01,85
0.62,18.10,-15.86,5HT_144234_01,86
0.64,28.10,-25.86,5HT_144234_01,87
0.64,16.81,-10.87,5HT_144234_02,2
0.14,16.11,-10.17,5HT_144234_02,3
0.14,16.21,-10.17,5HT_144234_02,4
0.15,31.85,-24.23,6-OHDA_153466,23
0.13,21.85,-34.23,6-OHDA_153466,24
0.11,11.85,-54.23,6-OHDA_153466,25
The problem is that I want to find each compound and its corresponding number from dataframe 1 in dataframe 2, and return its entire row.
I was only able to do this (but due to the way the iteration goes in this case, it doesn't work for what I intend to):
import numpy as np
import csv
import pandas as pd
for c1,n1,c2,n2 in zip(df1.compound,df1.number,df2.compound,df2.number):
if c1==c2 and n1==n2:
print(df2[*])
I wanted to print the entire row in which c1==c2 and n1==n2.
Example: for 17alpha_beta_SID_24898755 (compound) 8 (its number) in dataframe 1, return the row in which this compound and this number is found in dataframe 2. The result should be:
0.38,12.00,-19.00,17alpha_beta_SID_24898755,8
I'd like to do this for all the compounds and its corresponding numbers from dataframe1. The example I gave was only a small set from an extremely extensive list. If anyone could help, thank you!

Take a look at df.merge method:
df1.merge(df2, on=['compound', 'number'], how='inner')

How can I loop through just a certain part of a csv file?

I need to loop through certain rows in my CSV file, for example, row 231 to row 252. Then I want to add up the values that I get from calculating every row and divide them by as many rows as I looped through. How would I do that?
I'm new to pandas so I would really appreciate some help on this.
I have a CSV file from Yahoo finance looking something like this (it has many more rows):
Date,Open,High,Low,Close,Adj Close,Volume
2019-06-06,31.500000,31.990000,30.809999,31.760000,31.760000,1257700
2019-06-07,27.440001,30.000000,25.120001,29.820000,29.820000,5235700
2019-06-10,32.160000,35.099998,31.780001,32.020000,32.020000,1961500
2019-06-11,31.379999,32.820000,28.910000,29.309999,29.309999,907900
2019-06-12,29.270000,29.950001,28.900000,29.559999,29.559999,536800
I have done the basic steps of importing pandas and all that. Then I added two variables corresponding to different columns to easily reference to just that column.
import pandas as pd
df = pd.read_csv(file_name)
high = df.High
low = df.Low
Then I tried doing something like this. I tried using .loc in a variable, but that didn't seem to work. This is maybe super dumb but I'm really new to pandas.
dates = df.loc[231:252, :]
for rows in dates:
# calculations here
# for example:
print(high - low)
# I would have a more complex calculation than this but
# but for simplicity's sake let's stick with this.
The output of this would be for every row 1-252 it prints high - low, for example:
...
231 3.319997
232 3.910000
233 1.050001
234 1.850001
235 0.870001
...
But I only want this output on a certain number of rows.
Then I want to add up all of those values and divide them by the number of rows I looped. This part is simple so you don't need to include this in your answer but it's okay if you do.

Use skiprows and nrows. Keep headers as per Python Pandas read_csv skip rows but keep header by passing a range to skiprows that starts with 1.
In [9]: pd.read_csv("t.csv",skiprows=range(1,3),nrows=2)
Out[9]:
Date Open High Low Close Adj Close Volume
0 2019-06-10 32.160000 35.099998 31.780001 32.020000 32.020000 1961500
1 2019-06-11 31.379999 32.820000 28.910000 29.309999 29.309999 907900

.loc slices by label. For integer slicing use .iloc
dates = df.iloc[231:252]

aggregating and pasting

I am beginning to move from R to Python and have a stupid question.
I have been looking for close to 5 hours to find a solution to my question.
I have the following code in R, which essentially takes the dataframe df and aggregates the outdates from a hospital based on unique ids. So my original table has many UIds repeated since someone may visit a hospital many times and each time they leave the hospital they have an out date. I want the UID, and all the outdates in one row. I could do this very easily with the following code in R.
newdf= aggregate(data = df, OutDate~UID, FUN=paste, sep="," )
Can anyone pray tell me how this can be accomplished in Python?
HEre's what my table looks like after using the above function in R
-UID1, 10/20/2008, 11/30/2008, 1/1/1900, 1/1/1900
-UID2, 6/19/2010, 1/1/1900
-UID3, 11/17/2009
-UID4, 3/14/2010 , 4/20/2010, 1/1/1900, 1/1/1900
-UID5, 12/12/2008, 8/27/2009, 1/1/1900
Ignore the dates, i just made them up. But the output needs to look like above.
Previously I had multiple UID1 rows for each of the dates in the current columns.
Now how do I do this in python.

You can do this with a dictionary comprehension:
from collections import defauldict
d = defaultdict(list)
for f in df.values():
// Assuming the first value is the UID:
d[f[0]].append(f)
Now d is a dictionary, where each key is the UID and the values are a list of rows from the dataframe. You can combine them into a string (like what you are doing with paste), like this:
for uid,values in d.iteritems():
for value in values:
print('{},{}'.format(uid,','.join(value)))

This sounds like building a dictionary where the key is the UID and you append each outdate to the key as you loop through the data. This assumes that you are getting the data in the form of a csv file where3 each row of data is read by csv.DictReader. I make the assumption based on what you seem to show of the data file and the separators. As a result, each entry in the row (which can include in time, out time, diagnosis, etc) is keyed by the header row. I will alsao assume that you can tell how to read the data into csv processing. The quick code below shows how to generate the dictionary entries from the row once you have it in.
I show the final way the data will look followed by how it was derived.
data = {UID1:(out1, out2, out3), UID2:(out3, out4)}
data = {}
for d in datarow:
uid = d[UID]
if uid not in data.keys():
data[uid] = ()
out = d[OUT]
data[uid].append(out)

iteratively read (tsv) file for Pandas DataFrame

I have some experimental data which looks like this - http://paste2.org/YzJL4e1b (too long to post here). The blocks which are separated by field name lines are different trials of the same experiment - I would like to read everything in a pandas dataframe but have it bin together certain trials (for instance 0,1,6,7 taken together - and 2,3,4,5 taken together in another group). This is because different trials have slightly different conditions and I would like to analyze the results difference between these conditions. I have a list of numbers for different conditions from another file.
Currently I am doing this:
tracker_data = pd.DataFrame
tracker_data = tracker_data.from_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)
tracker_data['GazePointXLeft'] = tracker_data['GazePointXLeft'].astype(np.float64)
but this of course just reads everything in one go (including the field name lines) - it would be great if I could nest the blocks somehow which allows me to easily access them via numeric indices...
Do you have any ideas how I could best do this?

You should use read_csv rather than from_csv*:
tracker_data = pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)
If you want to join a list of DataFrames like this you could use concat:
trackers = (pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4) for i in range(?))
df = pd.concat(trackers)
* which I think is deprecated.

I haven't quite got it working, but I think that's because of how I copy/pasted the data. Try this, let me know if it doesn't work.
Using some inspiration from this question
pat = "TimeStamp\tGazePointXLeft\tGazePointYLeft\tValidityLeft\tGazePointXRight\tGazePointYRight\tValidityRight\tGazePointX\tGazePointY\tEvent\n"
with open('rec.txt') as infile:
header, names, tail = infile.read().partition(pat)
names = names.split() # get rid of the tabs here
all_data = tail.split(pat)
res = [pd.read_csv(StringIO(x), sep='\t', names=names) for x in all_data]
We read in the whole file so this won't work for huge files, and then partition it based on the known line giving the column names. tail is just a string with the rest of the data so we can split that, again based on the names. There may be a better way than using StringIO, but this should work.
I'm note sure how you want to join the separate blocks together, but this leaves them as a list. You can concat from there however you desire.
For larger files you might want to write a generator to read until you hit the column names and write a new file until you hit them again. Then read those in separately using something like Andy's answer.
A separate question from how to work with the multiple blocks. Assuming you've got the list of Dataframes, which I've called res, you can use pandas' concat to join them together into a single DataFrame with a MultiIndex (also see the link Andy posted).
In [122]: df = pd.concat(res, axis=1, keys=['a', 'b', 'c']) # Use whatever makes sense for the keys
In [123]: df.xs('TimeStamp', level=1, axis=1)
Out[123]:
a b c
0 NaN NaN NaN
1 0.0 0.0 0.0
2 3.3 3.3 3.3
3 6.6 6.6 6.6

I ended up doing it iteratively. very very iteratively. Nothing else seems to work.
pat = 'TimeStamp GazePointXLeft GazePointYLeft ValidityLeft GazePointXRight GazePointYRight ValidityRight GazePointX GazePointY Event'
with open(bhpath+fileid+'_wmet.tsv') as infile:
eye_data = infile.read().split(pat)
eye_data = [trial.split('\r\n') for trial in eye_data] # split at '\r'
for idx, trial in enumerate(eye_data):
trial = [row.split('\t') for row in trial]
eye_data[idx] = trial

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.