Sorting a file by a specific field

Sorting a file by a specific field - python

I have a file that has the following format:
12345 TAB_HERE Name : The Actual Name TAB_HERE 6785
eg.
1001020 Name : SMITH S ANNALOLA 14570
5701061 Name : MATTHEW SANDY HILL 6440
7001083 Name : TANYA MORRISON MILLER 14406
I want to sort by the last field of numbers.
I'd prefer a simple one line python solution or a linux tool based solution.
I tried using sort -k 3,3n but it did not work.
And I can't seem to write a single line python code that I can run as python -c "code here"
I looked at the following but to no avail:
http://www.unix.com/unix-dummies-questions-answers/18359-how-do-i-specify-tab-field-separator-sort.html
http://www.unix.com/unix-dummies-questions-answers/30450-sort-third-column-n-command.html
http://www.linuxquestions.org/questions/programming-9/unix-sort-on-multiple-fields-598813/

Quick solution:
import sys
print "".join(sorted(sys.stdin.readlines(), key=lambda x:int(x.split()[-1])))
This solution has some disadvantages. For example, it will not work if you have lines without number at the last field, or if you want sort the data not by the last field but by everything else. In this case you must use regular expressions (re module) and descrive the field that you want to use for sorting in the key function.

Python one liner:
cat file | python -c 'import sys; print "".join(sorted(sys.stdin.readlines(), key=lambda x:int(x.split()[-1])))'
My guess why the other python example won't work as a one liner is that he is using " to mark up the code and to invoke the join()...

I guess the --key parameter for the sort command counts the space characters.
sort -k7n
worked for me..

Related

Variable (that contains a space) in a Python subprocess command

I am having a problem using a path variable (that contains a space) in a Python subprocess command.
This should be so simple, yet I have wasted almost 3 hours trying to get the full value to work?
HELP!

Can you change repo to:
repo='/Users/derekm/"BGGoPlan Home"/"99.0 Repo"/Response/response-dashboard'
That should fix your issue.
If you'd like to programmatically solve this issue then I recommend doing the following:
repo_loc = repo_loc.replace(" ", "\ ")
Just before your sp2 = ... line.

Formatting a rpm query output with a separator

I am trying to get a list of all packages that are installed on my system. For this I call 'rpm -qai' from within a Python-script where further transformations on the output take place.
I kind of ran into the problem now that the output of above query does not separate the different packages. This looks something like this:
$ rpm -qai
Name : PackageName
Version : 1.0
...
LastEntry: Something
Name : NextPackageName
Version : 1.1
...
What I want is something along the line of
Name : PackageName
Version : 1.0
...
LastEntry: Something
//empty line or some other kind of separator
Name : NextPackageName
Version : 1.1
...
Since my script reads everything line for line and saves the lines in a dictionary. My workaround as of now checks, if the current line starts with 'Name' and if so, proceeds with appending the dictionary to a list and clearing the dictionary; this step is skipped for the very first line.
This solution is pretty ugly. Unfortunately, a fixed number of lines does not work as not all packages provide the same amount of information.
I also thought about running 'rpm -qai' first, retrieving a list of all package names from this, then iterating over the list while calling 'rpm -qi current_item'. Then one could grab the output from each single query. But since this requires two runs, I deem it unnecessary extra work.
So, does RPM (or some other tool) provide a feature which would allow the desired output?

There are python bindings for "proper" RPMDB interfacing instead of parsing "rpm" output. Think of it as git's porcelain vs plumbing. In fact, yum is all python (last time I checked). I think that will be better for you in the long run.
This documentation could be a good start.

You can use rpm --qf|--queryformat flag with format string for set the output format.
For example you can use rpm -qa --qf "%{NAME} %{VERSION}\n" for get fields interesting for you about every package separated as you want.
Or, just in your case, you can use something like rpm -qai --qf "\n####\n". You will get all fields about every package, but separator setted by you will be between them. Note that Description field may contains multiline text, so it is may be wrong to use \n as separator.
You can read about that in more details using man rpm.

Naming DataFrames iteritively using items from a List + a string

I have a list of names of countries.. and I have a large dataframe where one of the columns is ' COUNTRY ' (yes it has a space before and after the word country) I want to be create smaller DataFrames based on country names
cleaned_df[cleaned_df[' COUNTRY ']==asia_country_list[1]]
seems too long a command to achieve this? It does work though.
Now,
str("%s_data" % (asia_country_list[1]))
gives
'Taiwan_data'
but when I combine the above two:
str("%s_data" % (asia_country_list[1])) = cleaned_df[cleaned_df[' COUNTRY ']==asia_country_list[1]]
I get:
SyntaxError: can't assign to function call
happy to learn other ways as well to achieve this pls.. Thanks vm

I don't think you should do this, but if you really need it :
exec(str("%s_data" % (asia_country_list[1])) +"= cleaned_df[cleaned_df[' COUNTRY ']==asia_country_list[1]]")
should work.
Using a dictionary is likely to solve your problem
D={}
D["%s_data" % (asia_country_list[1]))]=cleaned_df[cleaned_df[' COUNTRY ']==asia_country_list[1]]]
EDIT : the first solution is a bad idea : exec is a dangerous command, if one column is named "del cleaned_df" you will actually execute it, it can get destructive. Typically I am guessing spaces are a problem in your case. It's a bit like SQL injections...

Parsing specific keywords in Select Statements and formatting

I have a sample select statement:
Select D.account_csn, D.account_key, D.industry_id, I.industry_group_nm, I.industry_segment_nm From ecs.DARN_INDUSTRY I JOIN ecs.DARN_ACCOUNT D
ON I.SRC_ID=D.INDUSTRY_ID
WHERE D.ACCOUNT_CSN='5070000240'
I would like to parse the select statements into separate files. The first file name is called ecs.DARN_INDUSTRY
and inside the file it should look like this:
industry_group_nm
industry_segment_nm
Similarly another file called ecs.DARN_ACCOUNT and the content looks like this:
account_csn
account_key
industry_id
How do I do this in Bash or Python??

I doubt you will find a truly simple answer (maybe someone can prove otherwise). However, you might find python-sqlparse useful.
Parsing general SQL statments will be complicated and it is difficult to guess exactly what you are trying to accomplish. However, I think you are trying to extract the tables and corresponding column references via sql parsing, in which case, look at this question which basically asks that very thing directly.

Here is a long working command through awk,
awk 'NR==1{gsub(/^.*\./,"",$5);gsub(/^.*\./,"",$6);gsub(/.$/,"",$5); printf $5"\n"$6"\n" > "DARN_INDUSTRY"; gsub(/^.*\./,"",$2);gsub(/^.*\./,"",$3);gsub(/^.*\./,"",$4);gsub(/.$/,"",$2);gsub(/.$/,"",$3);gsub(/.$/,"",$4); printf $2"\n"$3"\n"$4"\n" > "DARN_ACCOUNT"}' file
Explanation:
gsub(/^.*\./,"",$5) remove all the characters upto the first . symbol in colum number 5.
printf $5"\n"$6"\n" > "DARN_INDUSTRY" redirects the output of printf command to the file named DARN_INDUSTRY.
gsub(/.$/,"",$4) Removes the last character in column 4.

CSV credential python one liner

I have a csv file like this :
name,username
name2,username2
etc...
And I need to extract each column into lists so I can create a account (admin script).
I am hoping the result would look like this :
NAMES=( name name2 )
MAILS=( username username2 )
LENGHT=3 # number of lines in csv files actually
I would like to do it in python (because I use it elsewhere in my script and would like to convert my collegues to the dark side). Exept that I am not really a python user...
Something like this would do the trick (I assume) :
NAMES=( $(echo "$csv" | pythonFooMagic) )
MAILS=( $(echo "$csv" | python -c "import sys,csv; pythonFooMagic2") )
LENGHT=$(echo "$csv" | pythonFooMagic3)
I kind of found tutos to do it accross several lines but glued together it was ugly.
There must be some cool ways to do it. Else I will resign to use sed... Any ideas?
EDIT : ok bad idea, for future reference, see the comments

You could use a temporary file, like this:
tmpfile=$(mktemp)
# Python script produces the variables you want
pythonFooMagic < csv > $tmpfile
# Here you take the variables somehow. For example...
source $tempfile
rm $tmpfile

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting a file by a specific field - python

Python one liner: cat file | python -c 'import sys; print "".join(sorted(sys.stdin.readlines(), key=lambda x:int(x.split()[-1])))' My guess why the other python example won't work as a one liner is that he is using " to mark up the code and to invoke the join()...

I guess the --key parameter for the sort command counts the space characters. sort -k7n worked for me..

Related

Variable (that contains a space) in a Python subprocess command

Formatting a rpm query output with a separator

Naming DataFrames iteritively using items from a List + a string

Parsing specific keywords in Select Statements and formatting

CSV credential python one liner

Categories

Resources