I am working on a program to go through tweets and predict whether the author falls into one of two categories. I want to get_dummies for whether or not a tweet contains any of the top 10 hashtags or if it contains 'other'. (In the end I will probably be using the top 500 or so hashtags not just 10, the data set is over 500,000 columns in total with over 50,000 unique hashtags)
This is my first time using pandas, so apologies if my question is unclear, but I think what I'm expecting is each row in the data set would be given a new column, one for each hashtag, and then the value of that [row][column] pair would be 1 if the row contains that hashtag or 0 if it does not. There would also be a column for other indicating it has other hashtags not in the top 10.
I know how to determine the most frequently occurring in the column already
counts = df.hashtags.value_counts()
counts.nlargest(10)
I also understand how to get dummies, I just don't know how to add the parameter of not making one for every hashtag.
dummies = pd.get_dummies(df, columns=['hashtags'])
Please let me know if I could be clearer or provide more info. Appreciate the help!
Don't have time to gen data and work it all out. But though I'd get you this idea in case it might help you out.
The idea is to leverage .isin() to get the values that you need to build the dummies. Then leverage the power of the index to match to the source rows.
Something like:
pd.get_dummies(df.loc[df['hashtags'].isin(counts.nlargest(10).index)], columns=['hashtags'])
You will have to see if the indices will give you what you need.
I have a DataFrame like this : DATASET
I want to separate my data set into four parts.
Each of the four parts correspond to a title.
The difficulty is that each of these four parts of the table has different sizes. Moreover I don't have only one table but many others (with parts having different sizes but still four parts).
The goal is to find a way to separate my table by referring to the TITLE each time. Do you have an idea of how to do it?
Sincerely,
Etienne
I have put ~100 dataframes containing data into a list tables and a list of names (so I can call by name or just iterate over the whole bunch without needing names)
This data will need to be stored, appended to and later queried. So I want to store it as a pandas hdf5 store.
There are ~100 DFs but I can group them into pairs (two different observers).
In the end I want to iterate over all the list of tables but also
I've thought about Panels (but that will have annoying NaN values since the tables aren't the same length), hierachical hd5f (but that doesn't really solve anything, just groups by observer), one continuous dataframe (seeming as they have the same number of columns) (but that will just make it harder because I'll have to piece the DFs back together afterwards).
Is there anything blatantly obvious I'm missing, or am I just going to have to grin and bear it with one these? (if so which one would you go for to give the greatest flexibility?)
Thanks
I have a column in a CSV file that has names such that each cell in that column could be the same as a slightly misspelled cell. For example, "Nike" could be the same as "Nike inc." could be the same as "Nike Inc".
My Current Script
I've already written a program in Python that removes prefixes and suffixes from
each cell if that value occurs more than 2 times in the column as prefixes or
suffixes. I then compared one row to the next after sorting alphabetically in
this column.
My Current Problem
There are still many cells that are in reality duplicates of other cells, but they
are not indicated as such. These examples are:
a) Not exact matches (and not off just by capitalization)
b) Not caught by comparing its stem (without prefix and without suffix) to
its alphabetical neighbor
My current Questions
1) Does anyone have experience mapping IDs to names from all over the world
(so accents, unicode and all that stuff is an issue here, too, although I managed
to solve most of these unicode issues)
and has good ideas for algorithm development that are not listed here?
2) In some of the cases where duplicates are not picked up, I know why I
know they are duplicates. In one instance there is a period in the middle of a
line that is not present in its non-period containing brother cell. Is one good
strategy to simply to create an extra column and output cell values that I suspect
of being duplicates based on the few instances where I know why I know it?
3) How do I check myself? One way is to flag the maximum number of potential
duplicates and look over all of these manually. Unfortunately, the size of our
dataset doesn't make that very pretty, nor very feasible...
Thanks for any help you can provide!
Try to transliterate the names, to remove all the international symbols, then consider using a function like soundex or http://en.wikipedia.org/wiki/Levenshtein_distance (e.g. http://pypi.python.org/pypi/Fuzzy) to calculate text similarity.
I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.
Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)
Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.
I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples