I have a pandas series object
<class 'pandas.core.series.Series'>
that look like this:
userId
1 3072 1196 838 2278 1259
2 648 475 1 151 1035
3 457 150 300 21 339
4 1035 7153 953 4993 2571
5 260 671 1210 2628 7153
6 4993 1210 2291 589 1196
7 150 457 111 246 25
8 1221 8132 30749 44191 1721
9 296 377 2858 3578 3256
10 2762 377 2858 1617 858
11 527 593 2396 318 1258
12 3578 2683 2762 2571 2580
13 7153 150 5952 35836 2028
14 1197 2580 2712 2762 1968
15 1245 1090 1080 2529 1261
16 296 2324 4993 7153 1203
17 1208 1234 6796 55820 1060
18 1377 1 1073 1356 592
19 778 1173 272 3022 909
20 329 534 377 73 272
21 608 904 903 1204 111
22 1221 1136 1258 4973 48516
23 1214 1200 1148 2761 2791
24 593 318 162 480 733
25 314 969 25 85 766
26 293 253 4878 46578 64614
27 1193 2716 24 2959 2841
28 318 260 58559 8961 4226
29 318 260 1196 2959 50
30 1077 1136 1230 1203 3481
642 123 593 750 1212 50
643 750 671 1663 2427 5618
644 780 3114 1584 11 62
645 912 2858 1617 1035 903
646 608 527 21 2710 1704
647 1196 720 5060 2599 594
648 46578 50 745 1223 5995
649 318 300 110 529 246
650 733 110 151 318 364
651 1240 1210 541 589 1247
652 4993 296 95510 122900 736
653 858 1225 1961 25 36
654 333 1221 3039 1610 4011
655 318 47 6377 527 2028
656 527 1193 1073 1265 73
657 527 349 454 357 97
658 457 590 480 589 329
659 474 508 1 288 477
660 904 1197 1247 858 1221
661 780 1527 3 1376 5481
662 110 590 50 593 733
663 2028 919 527 2791 110
664 1201 64839 1228 122886 1203
665 1197 858 7153 1221 6539
666 318 300 161 500 337
667 527 260 318 593 223
668 161 527 151 110 300
669 50 2858 4993 318 2628
670 296 5952 508 272 1196
671 1210 1200 7153 593 110
What is the best way to go about outputting this to a txt file (e.g. output.txt) such that the format look like this?
User-id1 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
User-id2 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
The values on the far left are the userId's and the other values are the movieId's.
Here is the code that generated the above:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def predict(l):
# finds the userIds corresponding to the top 5 similarities
# calculate the prediction according to the formula
return (df[l.index] * l).sum(axis=1) / l.sum()
# use userID as columns for convinience when interpretering the forumla
df = pd.read_csv('ratings.csv').pivot(columns='userId',
index='movieId',
values='rating')
df = df - df.mean()
similarity = pd.DataFrame(cosine_similarity(
df.T.fillna(0)), index=df.columns, columns=df.columns)
res = df.apply(lambda col: ' '.join('{}'.format(mid) for mid in (0 * col).fillna(
predict(similarity[col.name].nlargest(6).iloc[1:])).nlargest(5).index))
#Do not understand why this does not work for me but works below
df = pd.DataFrame.from_items(zip(res.index, res.str.split(' ')))
#print(df)
df.columns = ['movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']
df['customer_id'] = df.index
df = df[['customer_id', 'movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']]
df.to_csv('filepath.txt', sep=' ', index=False)
I tried implementing #emmet02 solution but got this error, I do not understand why I got it though:
ValueError: Length mismatch: Expected axis has 671 elements, new values have 5 elements
Any advice is appreciated, please let me know if you need any more information or clarification.
I would suggest turning your pd.Series into a pd.DataFrame first.
df = pd.DataFrame.from_items(zip(series.index, series.str.split(' '))).T
So long as the Series has the same number of values (for every entry!), separated by a space, this will return a dataframe in this format
Out[49]:
0 1 2 3 4
0 3072 648 457 1035 260
1 1196 475 150 7153 671
2 838 1 300 953 1210
3 2278 151 21 4993 2628
4 1259 1035 339 2571 7153
Next I would name the columns appropriately
df.columns = ['movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']
Finally, the dataframe is indexed by customer id (I am supposing this based upon your series index). We want to move that into the dataframe, and then reorganise the columns.
df['customer_id'] = df.index
df = df[['customer_id', 'movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']]
This now leaves you with a dataframe like this
customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
0 0 3072 648 457 1035 260
1 1 1196 475 150 7153 671
2 2 838 1 300 953 1210
3 3 2278 151 21 4993 2628
4 4 1259 1035 339 2571 7153
which I would recommend you write to disk as a csv using
df.to_csv('filepath.csv', index=False)
If however you want to write it as a text file, with only spaces separating, you can use the same function but pass the separator.
df.to_csv('filepath.txt', sep=' ', index=False)
I don't think that the Series object is the correct choice of data structure for the problem you want to solve. Treating numerical data as numerical data (and in a DataFrame) is far easier than maintaining 'space delimited string' conversions imo.
You can use the following approach, splitting the items of your Series object (that I called s) into lists and converting those a list of those lists into a DataFrame object (that I called df):
df = pd.DataFrame([[s.index[i]] + s.str.split(' ')[i] for i in range(0, len(s))])
The [s.index[i]] + s.str.split(' ')[i] part is responsible for concatenation of the index at the beginning of the movie ids lists, and this is done for all rows available in the series.
After that, you could just dump the DataFrame to a .txt file using a space as separator:
df.to_csv('output.txt', sep=' ', index=False)
You could also name your columns before dumping it, as suggested earlier.
It's also worth avoiding that csv-writing hackery, kind of required when the series is text to avoid escaping/quoting hell. A la:
with open(filename, 'w') as f:
for entry in df['target_column']:
f.write(entry)
Of course you can add the series index yourself in the loop, if desired.
I suggest modifying the code as shown below
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def predict(l):
# finds the userIds corresponding to the top 5 similarities
# calculate the prediction according to the formula
return (df[l.index] * l).sum(axis=1) / l.sum()
# use userID as columns for convinience when interpretering the forumla
df = pd.read_csv('ratings.csv').pivot(columns='userId',
index='movieId',
values='rating')
df = df - df.mean()
similarity = pd.DataFrame(cosine_similarity(
df.T.fillna(0)), index=df.columns, columns=df.columns)
res = df.apply(lambda col: (0 * col).fillna(
predict(similarity[col.name].nlargest(6).iloc[1:])
).nlargest(5).index.tolist()
).apply(pd.Series).rename(
columns=lambda col_name: 'movie-id{}'.format(col_name + 1)).reset_index(
).rename(columns={'userId': 'customer_id'})
# convert to csv
res.to_csv('filepath.txt', sep = ' ',index = False)
res.head()
In [2]: res.head()
Out[2]:
customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
0 1 3072 1196 838 2278 1259
1 2 648 475 1 151 1035
2 3 457 150 300 21 339
3 4 1035 7153 953 4993 2571
4 5 260 671 1210 2628 7153
show the file
In [3]: ! head -5 filepath.txt
customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
1 3072 1196 838 2278 1259
2 648 475 1 151 1035
3 457 150 300 21 339
4 1035 7153 953 4993 2571
old question but adding an answer so that one can get help
From question's title it seems like user wanted to dump console output to a file — Use .to_string() method to dump a DataFrame (or Series) into a text file in the same format as we see on console. For example I copied OP's example, and prepared a DataFrame using pd.read_clipboard():
>>> df = pd.read_clipboard(index_col=0, names=['movie-id1',
'movie-id2',
'movie-id3',
'movie-id4',
'movie-id5'])
>>> df.index.name = 'userId'
>>> with open("/home/grijesh/Downloads/example.txt", 'w') as of:
df.to_string(buf=of)
One can also learn more about formatting code from io/formats/format.py
PS: used it for fairly big data-set it worked fine - used for text pattern observation observation.
Related
I have a column in a data frame (df1) that contains the number of columns I should subset in another dataframe (df2) before performing a calculation.
df1
count
0
5
1
6
2
8
3
1
4
9
df2
A
B
C
D
E
F
G
H
I
0
337
687
972
530
366
187
964
952
820
1
144
971
233
819
340
600
694
155
913
2
904
951
732
987
661
907
786
126
674
3
675
474
925
663
570
591
805
404
184
4
775
907
616
973
800
117
512
222
300
However, the number of columns used for the subset has a threshold/limit so I tried to write it like this:
df2['mean_6cols'] = np.where(df1['count'] >= 6, df2.iloc[:,:6].mean(axis=1), df2.iloc[:,:df1['count']].mean(axis=1))
So if df1['count'] is at least 6, I want to use the first 6 columns from df2, but if df1['count'] is less than 6, I want to use the value specified in the row.
Unfortunately, it results in the below presumably because of df1['count'] inside of iloc.
TypeError: cannot do positional indexing with these indexers
I did think of writing a for-loop and using the index variable to get the current value of df1['count'] for each row, but it's not a practical solution since I have a lot of different combinations of calculations/dataframes to do this for.
You can use numpy broadcasting to mask df2 by df1['count']:
mask = df1[['count']].to_numpy() > np.arange(df2.shape[1])
df2['mean_6cols'] = np.where(df1['count']>=6,
df2.iloc[:,:6].mean(axis=1),
df2.where(mask).mean(axis=1)
)
Output:
A
B
C
D
E
F
G
H
I
mean_6cols
0
337
687
972
530
366
187
964
952
820
578.400000
1
144
971
233
819
340
600
694
155
913
517.833333
2
904
951
732
987
661
907
786
126
674
857.000000
3
675
474
925
663
570
591
805
404
184
675.000000
4
775
907
616
973
800
117
512
222
300
698.000000
The following code generates a pandas.DataFrame from a 3D array over the first axis. I manually create the columns names (defining cols): is there a more built-in way to do this (to avoid potential errors e.g. regarding C-order)?
--> I am looking for a way to guarantee the respect of the order of the indices after the reshape operation (here it relies on the correct order of the iterations over range(nrow) and range(ncol)).
import numpy as np
import pandas as pd
nt = 6 ; nrow = 4 ; ncol = 3 ; shp = (nt, nrow, ncol)
np.random.seed(0)
a = np.array(np.random.randint(0, 1000, nt*nrow*ncol)).reshape(shp)
# This is the line I think should be improved --> any numpy function or so?
cols = [str(i) + '-' + str(j) for i in range(nrow) for j in range(ncol)]
adf = pd.DataFrame(a.reshape(nt, -1), columns = cols)
print(adf)
0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-0 3-1 3-2
0 684 559 629 192 835 763 707 359 9 723 277 754
1 804 599 70 472 600 396 314 705 486 551 87 174
2 600 849 677 537 845 72 777 916 115 976 755 709
3 847 431 448 850 99 984 177 755 797 659 147 910
4 423 288 961 265 697 639 544 543 714 244 151 675
5 510 459 882 183 28 802 128 128 932 53 901 550
EDIT
Illustrating why I don't like my solution - it is just too easy to make a code which technically works but produce a wrong result (inverting i and j or nrow and ncol):
wrongcols1 = [str(i) + '-' + str(j) for i in range(ncol) for j in range(nrow)]
adf2 = pd.DataFrame(a.reshape(nt, -1), columns=wrongcols1)
print(adf2)
0-0 0-1 0-2 0-3 1-0 1-1 1-2 1-3 2-0 2-1 2-2 2-3
0 684 559 629 192 835 763 707 359 9 723 277 754
1 804 599 70 472 600 396 314 705 486 551 87 174
2 600 849 677 537 845 72 777 916 115 976 755 709
3 847 431 448 850 99 984 177 755 797 659 147 910
4 423 288 961 265 697 639 544 543 714 244 151 675
5 510 459 882 183 28 802 128 128 932 53 901 550
wrongcols2 = [str(j) + '-' + str(i) for i in range(nrow) for j in range(ncol)]
adf3 = pd.DataFrame(a.reshape(nt, -1), columns=wrongcols2)
print(adf3)
0-0 1-0 2-0 0-1 1-1 2-1 0-2 1-2 2-2 0-3 1-3 2-3
0 684 559 629 192 835 763 707 359 9 723 277 754
1 804 599 70 472 600 396 314 705 486 551 87 174
2 600 849 677 537 845 72 777 916 115 976 755 709
3 847 431 448 850 99 984 177 755 797 659 147 910
4 423 288 961 265 697 639 544 543 714 244 151 675
5 510 459 882 183 28 802 128 128 932 53 901 550
Try this and see if it fits your use case:
Generate columns via a combination of np.indices, np.dstack and np.vstack :
columns = np.vstack(np.dstack(np.indices((nrow, ncol))))
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2],
[2, 0],
[2, 1],
[2, 2],
[3, 0],
[3, 1],
[3, 2]])
Now convert to string via a combination of map, join and list comprehension:
columns = ["-".join(map(str, entry)) for entry in columns]
['0-0',
'0-1',
'0-2',
'1-0',
'1-1',
'1-2',
'2-0',
'2-1',
'2-2',
'3-0',
'3-1',
'3-2']
Let's know how it goes.
You could try to use pd.MultiIndex to construct your hierarchy.
First redefine your cols to a list of tuples:
cols = [(i, j) for i in range(nrow) for j in range(ncol)]
Then construct the multi index with cols:
multi_cols = pd.MultiIndex.from_tuples(cols)
And build the dataframe:
adf = pd.DataFrame(a.reshape(nt, -1), columns=multi_cols)
Result:
0 1 2 3
0 1 2 0 1 2 0 1 2 0 1 2
0 684 559 629 192 835 763 707 359 9 723 277 754
1 804 599 70 472 600 396 314 705 486 551 87 174
2 600 849 677 537 845 72 777 916 115 976 755 709
3 847 431 448 850 99 984 177 755 797 659 147 910
4 423 288 961 265 697 639 544 543 714 244 151 675
5 510 459 882 183 28 802 128 128 932 53 901 550
Access of elements:
print(adf[1][2][0])
>>> 763
I am trying to stack several cyclical data elements of a DataFrame on top of each other to change the DataFrame dimensions. E.g. Go from 100x20 to 500x4.
Sample 11x7 input:
0 1 2 3 4 5 6 7
0 1 713 1622 658 1658 620 1734
1 2 714 1623 657 1700 618 1735
2 3 714 1624 656 1701 617 1736
3 4 714 1625 655 1702 615 1738
4 5 714 1626 654 1703 614 1739
5 6 713 1627 653 1705 612 1740
6 7 713 1628 651 1706 610 1741
7 8 713 1629 650 1707 609 1742
8 9 713 1630 649 1709 607 1744
9 10 713 1631 648 1710 605 1745
10 11 712 1632 646 1711 604 1746
Desired 32x3 output:
0 1 713 1622
1 2 714 1623
2 3 714 1624
3 4 714 1625
4 5 714 1626
5 6 713 1627
6 7 713 1628
7 8 713 1629
8 9 713 1630
9 10 713 1631
10 11 712 1632
11 1 658 1658
12 2 657 1700
13 3 656 1701
14 4 655 1702
15 5 654 1703
16 6 653 1705
17 7 651 1706
18 8 650 1707
19 9 649 1709
20 10 648 1710
21 11 646 1711
22 1 620 1734
23 2 618 1735
24 3 617 1736
25 4 615 1738
26 5 614 1739
27 6 612 1740
28 7 610 1741
29 8 609 1742
30 9 607 1744
31 10 605 1745
32 11 604 1746
I have spent an inordinate amount of time checking this, and I cannot find anything better than
pd.concat([df1, df2], ignore_index=True)
or
df1.append(df2, ignore_index=True)
, which should produce identical solutions inthis case. However, whichever option is used, it is going to be placed at the end of a loop that produces temporary DataFrames to be concatenated with or appended to the permanent DataFrame. The temp df's come out fine, but the allegedly straightforward concatenation step fails consistently. I get an empty DataFrame with a proper header...
for l in range(1,13):
s1 = l * 4 - 4
s2 = l * 4
dft = df0.iloc[:, s1:s2]
dft.columns = new_col
#pd.concat([df1, dft], ignore_index=True, axis = 0)
#df1.append(dft, ignore_index=True)
df1.head()
Either of the commented out lines should produce a stack of 4-wide temp DataFrames... I get an empty DataFrame with a proper header and no error messages...
Solved by #Aryerez in a comment above:
Both of pd.concat() and df.append() are by default not in place. See if df1 = pd.concat(etc...) solves it.
having a dataframe such as:
myFrame = pd.DataFrame(np.random.randint(1000, size=[7,4]),
index=[['GER','GER','GER','GER','FRA','FRA','FRA'],
['Phone','Email','Chat','Other','Phone','Chat','Email']])
0 1 2 3
GER Phone 765 876 588 933
Email 819 364 42 73
Chat 954 665 317 918
Other 692 531 312 206
FRA Phone 272 261 426 270
Chat 158 172 347 902
Email 453 721 67 6
How could I easily add the missing index(es) of the inner level? E.g. you can see that GER has an "Other" index label. I'd like to add that "Other" to all countries and fill it's values with e.g. 0. There might be a third outer index (e.g. ITA), for which yet another inner-index could be found (e.g. SMS).
At the end, all countries should have exactly the same amount of indexes.
Thanks!
Use reindex with MultiIndex.from_product created by unique values of each levels generated by MultiIndex.get_level_values:
mux = pd.MultiIndex.from_product([myFrame.index.get_level_values(0).unique(),
myFrame.index.get_level_values(1).unique()])
print (myFrame.reindex(mux, fill_value=0))
0 1 2 3
GER Phone 250 614 226 777
Email 917 156 148 902
Chat 537 665 87 75
Other 431 203 921 572
FRA Phone 159 790 646 810
Email 294 205 949 726
Chat 209 895 128 282
Other 0 0 0 0
Another solution with unstack and stack - MultiIndex is sorted:
print (myFrame.unstack(fill_value=0).stack(dropna=False))
0 1 2 3
FRA Chat 209 895 128 282
Email 294 205 949 726
Other 0 0 0 0
Phone 159 790 646 810
GER Chat 537 665 87 75
Email 917 156 148 902
Other 431 203 921 572
Phone 250 614 226 777
To substitute the numbers with their corresponding "ranks":
import pandas as pd
import numpy as np
numbers = np.random.random_integers(low=0.0, high=10000.0, size=(1000,))
df = pd.DataFrame({'a': numbers})
df['a_rank'] = df['a'].rank()
I am getting the float values as the default output type of rankmethod:
987 82.0
988 36.5
989 526.0
990 219.0
991 957.0
992 819.5
993 787.5
994 513.0
Instead of floats I would rather have the integers. Rounding the resulted float values using asType(int) would be risky since converting to int would probably introduce the duplicated values from the float values that are too close to each other such as 3.5 and 4.0. Those when converted to the integers both would result to the integer value of 4.
Is there any way to guide rank method to output the integers?
The above solution did not work for me. The following did work though. The critical line with edits is:
df['a_rank'] = df['a'].rank(method='dense').astype(int);
This could be a version issue.
Pass param method='dense', this will increase the ranks by 1 between groups, see the docs:
In [2]:
numbers = np.random.random_integers(low=0.0, high=10000.0, size=(1000,))
df = pd.DataFrame({'a': numbers})
df['a_rank'] = df['a'].rank(method='dense')
df
Out[2]:
a a_rank
0 1095 114
1 2514 248
2 500 53
3 6112 592
4 5582 533
5 851 91
6 2887 287
7 3798 366
8 4698 458
9 1699 170
10 4739 462
11 7199 693
12 817 88
13 3801 367
14 5584 534
15 4939 481
16 2569 258
17 6806 656
18 93 8
19 8574 816
20 4107 396
21 7086 684
22 6819 657
23 8844 847
24 170 15
25 6629 634
26 9905 950
27 5312 512
28 3794 365
29 9476 911
.. ... ...
970 4607 447
971 8430 801
972 6527 625
973 2794 280
974 4414 425
975 1069 111
976 2849 285
977 7955 759
978 5767 547
979 7767 742
980 2956 294
981 5847 554
982 1029 107
983 4967 485
984 256 25
985 5577 532
986 6866 662
987 5903 563
988 1785 181
989 749 78
990 2164 212
991 1074 112
992 8752 837
993 2737 272
994 2761 277
995 7355 705
996 8956 857
997 4831 473
998 222 21
999 9531 917
[1000 rows x 2 columns]
No need to use method='dense', just convert to an integer.
df['a_rank'] = df['a'].rank().astype(int)