Group by in pandas dataframe and unioning a numpy array column - python

I have a CSV file where one of the columns looks like a numpy array. The first few lines look like the following
first,second,third
170.0,2,[19 234 376]
170.0,3,[19 23 23]
162.0,4,[1 2 3]
162.0,5,[1 3 4]
When I load the this CSV with pandas data frame and using the following code
data = pd.read_csv('myfile.csv', converters = {'first': np.float64, 'second': np.int64, 'third': np.array})
Now, I want to group by based on the 'first' column and union the 'third' column. So after doing this my dataframe should look like
170.0, [19 23 234 376]
162.0, [1 2 3 4]
How do I achieve this? I tried multiple ways like the following and nothing seems to help achieve this goal.
group_data = data.groupby('first')
group_data['third'].apply(lambda x: np.unique(np.concatenate(x)))

With your current csv file the 'third' column comes in as a string, instead of a list.
There might be nicer ways to convert to a list, but here goes...
from ast import literal_eval
data = pd.read_csv('test_groupby.csv')
# Convert to a string representation of a list...
data['third'] = data['third'].str.replace(' ', ',')
# Convert string to list...
data['third'] = data['third'].apply(literal_eval)
group_data=data.groupby('first')
# Two secrets here revealed
# x.values instead of x since x is a Series
# list(...) to return an aggregated value
# (np.array should work here, but...?)
ans = group_data.aggregate(
{'third': lambda x: list(np.unique(
np.concatenate(x.values)))})
print(ans)
third
first
162 [1, 2, 3, 4]
170 [19, 23, 234, 376]

Related

How to store the information returned from the value_counts() function

I have the following dataframe:
import pandas as pd
import numpy as np
df_Station_Weather = pd.DataFrame({'ID': [102, 102, 558, 115, 115, 501, 501, 10, 10, 10, 10],
'Code_Instrument': ['SEN_wider1898', 'SEN_UV',
'SEN_wider1898', 'SEN_wider1898',
'SEN_rain1015', 'SEN_01', 'SEN_01',
'SEN_AD', 'SEN_AD', 'SEN_AD',
'SEN_AD']})
print(df_Station_Weather)
ID Code_Instrument
102 SEN_wider1898
102 SEN_UV
558 SEN_wider1898
115 SEN_wider1898
115 SEN_rain1015
501 SEN_01
501 SEN_01
10 SEN_AD
10 SEN_AD
10 SEN_AD
10 SEN_AD
I would like to count the number of specific Instruments. So, I did the following:
list_Instrument = df_Station_Weather['Code_Instrument'].value_counts()
I would like to select only the bottom three of the count. So, I did the following:
list_Instrument_2 = list_Instrument.head(3)
I need to create an array containing the name of the ID of the three largest. It is in this part that I am in doubt.
I tried to build the array with the code:
array = np.array(list_Instrument_2)
However, this created array stores the count values, but I would like it to store the name of the Code_Instrument.
#Output:
print(array)
> array([4, 3, 2], dtype=int64)
# Desired output
array([SEN_AD, SEN_wider1898, SEN_01]
Thank you.
Try this,
list_Instrument.head(3).index.values
array(['SEN_AD', 'SEN_wider1898', 'SEN_01'], dtype=object)
If you want the names:
print(list_Instrument.head(3).index)
Returns:
Index(['SEN_AD', 'SEN_wider1898', 'SEN_01'], dtype='object')
Or print them nicely with a for loop:
for name in list_Instrument.head(3).index:
print(name)
To obtain:
SEN_AD
SEN_wider1898
SEN_01
Finally, to store them directly into a list:
res = print(list(list_Instrument.head(3).index))
print(res)
Will give:
['SEN_AD', 'SEN_wider1898', 'SEN_01']
I would suggest you use a Hashtable instead of an Array to save your counts of occurrences. U could use the Id of your Code_Instruments as the Hashtable Keys and the number of occurrences as the designated value. You can get the values you need afterwards with an index, with a key, or a value. Here is a link on how to use Hashtables: https://www.tutorialsteacher.com/csharp/csharp-hashtable

How do I normalize a Python matrix with one CSV column?

I have a matrix where ONE COLUMN is a CSV, like this:-
matrix = [
[1,"123,354,23"],
[2,"234,34,678"]
]
How do I normalize this, so I get one row for each value in the CSV column, i.e. so that it looks like this:-
[
[1, 123],
[1, 354],
[1, 23],
[2, 234],
[2, 34],
[2, 678]
]
I'm open to using numpy or pandas.
Note, in my specific case there are many other non-CSV columns too.
Thanks
In the example you gave, this will do it:
matrix = [
[1,"123,354,23"],
[2,"234,34,678"]
]
import ast
expanded = [
[ index, item ]
for index, rowString in matrix
for item in ast.literal_eval('[' + rowString + ']')
]
For your other "non-CSV" cases it depends on how they are formatted. Here, ast.literal_eval was a good tool for converting your apparent standard (comma-separated string) into a Python sequence that the variable item could iterate over. Other conversion approaches might be needed for other formats.
That produces a list of lists exactly as you specified. pandas is a good tool to use from there though. To then convert the list of lists into a pandas.DataFrame, you could say:
import pandas as pd
df = pd.DataFrame(expanded, columns=['index', 'item']).set_index(['index'])
print(df)
# prints:
#
# item
# index
# 1 123
# 1 354
# 1 23
# 2 234
# 2 34
# 2 678
Or, if by "many other non-CSV columns" you just mean an arbitrary number of additional entries in each row of matrix, but that the last one is still always CSV text, then it could look like this:
matrix = [
[1, 3.1415927, 'Mary Poppins', "123,354,23"],
[2, 2.7182818, 'Genghis Khan', "234,34,678"]
]
import ast
expanded = [
row[:-1] + [item]
for row in matrix
for item in ast.literal_eval('[' + row[-1] + ']')
]
import pandas as pd
df = pd.DataFrame(expanded).set_index([0])
If the matrix contains couples of the form (first, text), you can write:
result = [
[first, int(rest)]
for first, text in matrix
for rest in text.split(",")]
Or, without comprehension list:
result = []
for first, text in matrix:
for rest in text.split(","):
result.append([first, int(rest)])

Selectively replacing DataFrames column names

I have a time series dataset in a .csv file that I want to process with Pandas (using Canopy). The column names from the file are a mix of strings and isotopic numbers.
cycles 40 38.02 35.98 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
I would like this DataFrame to look like this
cycles 40 38 36 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
The .csv files won't always have exactly the same column names; they numbers could be slightly different from file to file. To handle this, I've sampled the column names and rounded the values to the nearest integer.This is what my code looks like so far:
import pandas as pd
import numpy as np
df = {'cycles':[1,2,3],'40':[1.1e-8,2.2e-8,3.3e-8],'38.02':[4.4e-8,5.5e-8, 6.6e-8],'35.98':[7.7e-8,8.8e-8,9.9e-8,],'P4':[8.8e-7,8.7e-7,8.6e-7]}
df = pd.DataFrame(df, columns=['cycles', '40', '38.02', '35.98', 'P4'])
colHeaders = df.columns.values.tolist()
colHeaders[1:4] = list(map(float, colHeaders[1:4]))
colHeaders[1:4] = list(map(np.around, colHeaders[1:4]))
colHeaders[1:4] = list(map(int, colHeaders[1:4]))
colHeaders = list(map(str, colHeaders))
I tried df.rename(columns={df.loc[ 1 ]:colHeaders[ 0 ]}, ...), but I get this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I've read this post as well as the pandas 0.17 documentation, but I can't figure out how to use it to selectively replace the column names in a way that doesn't require me to assign new column names manually like this post.
I'm fairly new to Python and I've never posted on StackOverflow before, so any help would be greatly appreciated.
You could use a variant of your approach, but assign the new columns directly:
>>> cols = list(df.columns)
>>> cols[1:-1] = [int(round(float(x))) for x in cols[1:-1]]
>>> df.columns = cols
>>> df
cycles 40 38 36 P4
0 1 1.100000e-08 4.400000e-08 7.700000e-08 8.800000e-07
1 2 2.200000e-08 5.500000e-08 8.800000e-08 8.700000e-07
2 3 3.300000e-08 6.600000e-08 9.900000e-08 8.600000e-07
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')
Or you could pass a function to rename:
>>> df = df.rename(columns=lambda x: x if x[0].isalpha() else int(round(float(x))))
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')

read file into a dictionary so that columns become keys and values, Python

I have a data file in the following format:
9, 12, 16, ABC, a12d
8, 09, 24, ADP, v154a
6, 07, 16, ADP, l28a
2, 14, 15, CDE, d123p
I need to build a dictionary of sets in the following format:
ABC : ([a12d])
ADP : ([v154a, l128a])
CDE : ([d123p])
I can build a set of any of the columns eg:
with open('data.csv','r') as r:
name = set([line.strip().split(',')[3] for line in r])
I figure there must be a way to make every element in the set into a dictionary key and its adjacent value add to a set? There is an added complication that some of the keys have a multiple values (for example lines 2 and 3 above) but they are separated with into separate lines.
Thanks in advance for any help
from collections import defaultdict
d = defaultdict(set)
with open('data.csv','r') as r:
for line in r:
splitted = line.strip().split(',')
name = splitted[3].strip()
value = splitted[4].strip()
d[name].add(value)
If you don't mind using pandas:
import pandas as pd
df = pd.read_csv("data.csv", header=None, usecols=[3,4], index_col=0, skipinitialspace=1, names=["key", "value"])
Which can be read as read data.csv, which contain no header, use only columns 3 and 4, and use column 0 (formerly 3) as the index. Skip the initial space in the values, and name the column you read (3 and 4) key and value. This will give you:
df
value
key
ABC a12d
ADP v154a
ADP l28a
CDE d123p
So you can access any value with .loc:
df.loc["ABC"].values
array(['a12d'], dtype=object)
df.loc["ADP"].values
array([['v154a'],
['l28a']], dtype=object)
For the latter, you can flatten the array with ravel():
df.loc["ADP"].values.ravel()
array(['v154a', 'l28a'], dtype=object)
So it's not really a dictionary, but it behaves a bit like it, and you can do much more with this kind of object (a pandas Dataframe). Plus you can easily read and write csv files.
If you don't know pandas, have a look :
http://pandas.pydata.org/
http://pandas.pydata.org/pandas-docs/stable/
Here is the below code to read column values and then convert them to the dictionary in python
cat dictionary.txt (This txt has info about Name Age Birthyear)
Luffy 20 2000
Nami 18 2002
Chopper 10
##################### code is here #######
#!/usr/bin/python3.7.4
d = {}
with open("dictionary.txt") as f:
for line in f:
line=line.split()
d.setdefault(line[0],[]).append(line[1])
if len(line)==3:
d.setdefault(line[0],[]).append(line[2])
else:
d.setdefault(line[0],[]).append('NULL')
print(d)
Output:
{'Luffy': ['20', '2000'], 'Nami': ['18', '2002'], 'Chopper': ['10', 'NULL']}

how to convert csv to dictionary using pandas

How can I convert a csv into a dictionary using pandas? For example I have 2 columns, and would like column1 to be the key and column2 to be the value. My data looks like this:
"name","position"
"UCLA","73"
"SUNY","36"
cols = ['name', 'position']
df = pd.read_csv(filename, names = cols)
Since the 1st line of your sample csv-data is a "header",
you may read it as pd.Series using the squeeze keyword of pandas.read_csv():
>>> pd.read_csv(filename, index_col=0, header=None, squeeze=True).to_dict()
{'UCLA': 73, 'SUNY': 36}
If you want to include also the 1st line, remove the header keyword (or set it to None).
Convert the columns to a list, then zip and convert to a dict:
In [37]:
df = pd.DataFrame({'col1':['first','second','third'], 'col2':np.random.rand(3)})
print(df)
dict(zip(list(df.col1), list(df.col2)))
col1 col2
0 first 0.278247
1 second 0.459753
2 third 0.151873
[3 rows x 2 columns]
Out[37]:
{'third': 0.15187291615699894,
'first': 0.27824681093923298,
'second': 0.4597530377539677}
ankostis answer in my opinion is the most elegant solution when you have the file on disk.
However, if you do not want to or cannot go the detour of saving and loading from the file system, you can also do it like this:
df = pd.DataFrame({"name": [73, 36], "position" : ["UCLA", "SUNY"]})
series = df["position"]
series.index = df["name"]
series.to_dict()
Result:
{'UCLA': 73, 'SUNY': 36}

Categories