Using map to convert pandas dataframe to list - python

I am using map to convert some columns in a dataframe to list of dicts. Here is a MWE illustrating my question.
import pandas as pd
df = pd.DataFrame()
df['Col1'] = [197, 1600, 1200]
df['Col2'] = [297, 2600, 2200]
df['Col1_a'] = [198, 1599, 1199]
df['Col2_a'] = [296, 2599, 2199]
print(df)
The output is
Col1 Col2 Col1_a Col2_a
0 197 297 198 296
1 1600 2600 1599 2599
2 1200 2200 1199 2199
Now say I want to extract only those columns whose name ends with a suffix "_a". One way to do it is the following -
list_col = ["Col1","Col2"]
cols_w_suffix = map(lambda x: x + '_a', list_col)
print(df[cols_w_suffix].to_dict('records'))
[{'Col1_a': 198, 'Col2_a': 296}, {'Col1_a': 1599, 'Col2_a': 2599}, {'Col1_a': 1199, 'Col2_a': 2199}]
This is expected answer. However, if I try to print the same expression again, I get an empty dataframe.
print(df[cols_w_suffix].to_dict('records'))
[]
Why does it evaluate to an empty dataframe? I think I am missing something about the behavior of map. Because when I directly pass the column names, the output is still as expected.
df[["Col1_a","Col2_a"]].to_dict('records')
[{'Col1_a': 198, 'Col2_a': 296}, {'Col1_a': 1599, 'Col2_a': 2599}, {'Col1_a': 1199, 'Col2_a': 2199}]

Your map generator is exhausted.
Use cols_w_suffix = list(map(lambda x: x + '_a', list_col)) or a list comprehension cols_w_suffix = [f'{x}_a' for x in list_col].
That said, a better method to select the columns would be:
df.filter(regex='_a$')
Or:
df.loc[:, df.columns.str.endswith('_a')]

Related

Pandas Filtering Column

Following up on the accepted answer at another question on SO (Filtering a dataframe by column name based on multiple conditions):
import pandas as pd
c = ["XYYZ 1011", "XYYZ 1021", "XYYZ 1031", "XXYZ 1011", "XXYZ 1021", "XXYZ 1031","XYY 1001", "XYY 1002", "XXZ 1001"]
df = pd.DataFrame(columns=c)
print(df)
df = df.filter(regex='X[XY|YY]Z 10[1|2|3]1')
print(df)
The output of print misses XXYZ 1011, XXYZ 1021, XXYZ 1031 etc column Why?
IIUC, your regex should be 'X(XY|YY)Z 10[123]1':
df.filter(regex='X(XY|YY)Z 10[123]1')
output:
Empty DataFrame
Columns: [XYYZ 1011, XYYZ 1021, XYYZ 1031, XXYZ 1011, XXYZ 1021, XXYZ 1031]
Index: []
You have to differentiate characters groups [123] -> one "1 or 2 or 3" character, and alternative patterns (ABC|DEF) the string ABC or DEF

Creating new columns in a csv file using data from a different csv file

I have this Data Science problem where I need to create a test set using info provided in two csv files.
Problem
data1.csv
cat,In1,In2
aaa, 0, 1
aaa, 2, 1
aaa, 2, 0
aab, 3, 2
aab, 1, 2
data2.csv
cat,index,attribute1,attribute2
aaa, 0, 150, 450
aaa, 1, 250, 670
aaa, 2, 30, 250
aab, 0, 60, 650
aab, 1, 50, 30
aab, 2, 20, 680
aab, 3, 380, 250
From these two files what I need is a updated data1.csv file. Where in place of In1 and In2, I need the attributes of the specific indices(In1 and In2), under a specific category (cat).
Note: All the indices in a specific category (cat) have their own attributes.
Result should look like this,
updated_data1.csv
cat,In1a1,In1a2,In2a1,In2a2
aaa, 150, 450, 250, 670
aaa, 30, 250, 250, 670
aaa, 30, 250, 150, 450
aab, 380, 250, 20, 680
aab, 50, 30, 20, 680
I need an approach to tackle this problem using pandas in python. So far I have loaded the csv files in to my jupyter notebook. And I have no clue where to start.
Please note this is my first week using python for data manipulation and I have a very little knowledge on python. Also pardon me for ugly formatting. I'm using the mobile phone to type this question.
As others have suggested, you can use pd.merge. In this case, you need to merge on multiple columns. Basically you need to define which columns from the left DataFrame (here data1) map to which columns from the right DataFrame (here data2). Also see pandas merging 101.
# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# DataFrame with the in1 columns
df1 = pd.merge(left=data1, right=data2, left_on = ['cat','In1'], right_on = ['cat', 'index'])
df1 = df1[['cat','attribute1','attribute2']].set_index('cat')
# DataFrame with the in2 columns
df2 = pd.merge(left=data1, right=data2, left_on = ['cat','In2'], right_on = ['cat', 'index'])
df2 = df2[['cat','attribute1','attribute2']].set_index('cat')
# Join the two dataframes together.
df = pd.concat([df1, df2], axis=1)
# Name the columns as desired
df.columns = ['in1a1', 'in1a2', 'in2a1', 'in2a2']
One should generally try to avoid iterating through DataFrames, because it's not very efficient. But it's definitely a possible solution here.
# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# This list will be the data for the resulting DataFrame
rows = []
# Iterate through data1, unpacking values in each row to variables
for idx, cat, in1, in2 in data1.itertuples():
# Create a dictionary for each row where the keys are the column headers of the future DataFrame
row = {}
row['cat'] = cat
# Pick the correct row from data2
in1 = (data2['index'] == in1) & (data2['cat'] == cat)
in2 = (data2['index'] == in2) & (data2['cat'] == cat)
# Assign the correct values to the keys in the dictionary
row['in1a1'] = data2.loc[in1, 'attribute1'].values[0]
row['in1a2'] = data2.loc[in1, 'attribute2'].values[0]
row['in2a1'] = data2.loc[in2, 'attribute1'].values[0]
row['in2a2'] = data2.loc[in2, 'attribute2'].values[0]
# Append the dictionary to the list
rows.append(row)
# Construct a DataFrame from the list of dictionaries
df = pd.DataFrame(rows)

Retrieve dataframe row based on list from a cell value

I am trying to retrieve a row from a pandas dataframe where the cell value is a list. I have tried isin, but it looks like it is performing OR operation, not AND operation.
>>> import pandas as pd
>>> df = pd.DataFrame([['100', 'RB','stacked'], [['101','102'], 'CC','tagged'], ['102', 'S+C','tagged']],
columns=['vlan_id', 'mode' , 'tag_mode'],index=['dinesh','vj','mani'])
>>> df
vlan_id mode tag_mode
dinesh 100 RB stacked
vj [101, 102] CC tagged
mani 102 S+C tagged
>>> df.loc[df['vlan_id'] == '102']; # Fetching string value match
vlan_id mode tag_mode
mani 102 S+C tagged
>>> df.loc[df['vlan_id'].isin(['100','102'])]; # Fetching if contains either 100 or 102
vlan_id mode tag_mode
dinesh 100 RB stacked
mani 102 S+C tagged
>>> df.loc[df['vlan_id'] == ['101','102']]; # Fails ?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\ops.py", line 1283, in wrapper
res = na_op(values, other)
File "C:\Python27\lib\site-packages\pandas\core\ops.py", line 1143, in na_op
result = _comp_method_OBJECT_ARRAY(op, x, y)
File "C:\Python27\lib\site-packages\pandas\core\ops.py", line 1120, in _comp_method_OBJECT_ARRAY
result = libops.vec_compare(x, y, op)
File "pandas\_libs\ops.pyx", line 128, in pandas._libs.ops.vec_compare
ValueError: Arrays were different lengths: 3 vs 2
I can get the values to a list and compare it. Instead, Is there any way available where we can check it against a list value using .loc method itself?
To find a list you can iterate over the values of vlan_id and compare each value using np.array_equal:
df.loc[[np.array_equal(x, ['101','102']) for x in df.vlan_id.values]]
vlan_id mode tag_mode
vj [101, 102] CC tagged
Although, it's advised to avoid using lists as cell values in a dataframe.
DataFrame.loc can use a list of labels or a Boolean array to access rows and columns. The list comprehension above contructs a Boolean array.
I am not sure if this is the best way to do this, or if there is a good way to do this, since as far as I know pandas doesn't really support storing lists in Series. Still:
l = ['101', '102']
df.loc[pd.concat([df['vlan_id'].str[i] == l[i] for i in range(len(l))], axis=1).all(axis=1)]
Output:
vlan_id mode tag_mode
vj [101, 102] CC tagged
Another workaround would be to transform your vlan_id columns so that it can be queried as a string. You can do that by joining your vlan_id list values into comma-separated strings.
df['proxy'] = df['vlan_id'].apply(lambda x: ','.join(x) if type(x) is list else ','.join([x]) )
l = ','.join(['101', '102'])
print(df.loc[df['proxy'] == l])

How can I read 2 columns in dataframe in pandas and return a list of lists of the data?

I just started learning pandas and I have a dataframe that looks like
Date Average Volume
2013-02-07 400 4100
2013-02-08 389 3400
2013-02-23 380 3100
If the user says they want the information from the 1st column(I'm referring to the Average here, I am excluding date as its sort of a constant), I want it to return
['2013-02-07', 400]
['2013-02-08', 389]
['2013-02-23', 380]
If they asked for the info from the 2nd column it would return the date and volume info in the same format.
data_list(file_object,column_number)
inp = int(input('Which column?' ))
if inp = 1:
df['Average'].iloc[0:]
if inp = 2:
df['Volume'].iloc[0:]
This returns the column the user wants, but how can I return it with the date in the format requested above?
You can use values.tolist
>>> df[['Date','Average']].values.tolist()
[['2013-02-07', 400], ['2013-02-08', 389], ['2013-02-23', 380]]
If you want a generator, you can use map
>>> map(list, df[['Date','Average']].values)
<map object at 0x7f3fd47023c8>
>>>
>>> [*map(list, df[['Date','Average']].values)]
[['2013-02-07', 400], ['2013-02-08', 389], ['2013-02-23', 380]]
You can precalculate your lists of lists and use a dictionary to store the results. In addition, you can use pd.Series.dt.strftime to format your date as required.
Here's a demo:
df['Date'] = pd.to_datetime(df['Date'])
df_list = {col: df.assign(Date=df.Date.dt.strftime('%Y-%m-%d'))\
.loc[:, ['Date', col]].values.tolist() \
for col in ('Average', 'Volume')}
select = input('Enter a column name:\n')
print(df_list[select])
Example result:
Enter a column name:
Volume
[['2013-02-07', 4100], ['2013-02-08', 3400], ['2013-02-23', 3100]]

how to convert csv to dictionary using pandas

How can I convert a csv into a dictionary using pandas? For example I have 2 columns, and would like column1 to be the key and column2 to be the value. My data looks like this:
"name","position"
"UCLA","73"
"SUNY","36"
cols = ['name', 'position']
df = pd.read_csv(filename, names = cols)
Since the 1st line of your sample csv-data is a "header",
you may read it as pd.Series using the squeeze keyword of pandas.read_csv():
>>> pd.read_csv(filename, index_col=0, header=None, squeeze=True).to_dict()
{'UCLA': 73, 'SUNY': 36}
If you want to include also the 1st line, remove the header keyword (or set it to None).
Convert the columns to a list, then zip and convert to a dict:
In [37]:
df = pd.DataFrame({'col1':['first','second','third'], 'col2':np.random.rand(3)})
print(df)
dict(zip(list(df.col1), list(df.col2)))
col1 col2
0 first 0.278247
1 second 0.459753
2 third 0.151873
[3 rows x 2 columns]
Out[37]:
{'third': 0.15187291615699894,
'first': 0.27824681093923298,
'second': 0.4597530377539677}
ankostis answer in my opinion is the most elegant solution when you have the file on disk.
However, if you do not want to or cannot go the detour of saving and loading from the file system, you can also do it like this:
df = pd.DataFrame({"name": [73, 36], "position" : ["UCLA", "SUNY"]})
series = df["position"]
series.index = df["name"]
series.to_dict()
Result:
{'UCLA': 73, 'SUNY': 36}

Categories