Python/Dataframes: Column count and names? - python

Sorry for this newbie question. Working in Python, I need to get an integer value of the number of columns in the df and for each of them I need to get the column names.
df.printSchema() displays a nice tree view
df.describe().show() displays some stats as well.
but can't seem to find a way to do get a count of columns and an array of column names. Perhaps I should do them using SQL API but am not quite familiar with it yet. Still learning basics... Thanks so much!

Related

How do I create a column with certain conditions built in (not the same as a conditional column) in python

I've attached a screenshot of a table in excel, but I'm doing this in pythonenter image description here
I'm trying to recreate the column "predict" in python, I have the other columns already. I am trying to get the first row of "predict" to be equal to the first row of "ytd" and then for every value following that one, I want it to be the result of the "nc" value multiplied by the previous value in the "predict" column. It doesn't have to be done in this particular order or in this way, I just want that to be the end result, and any clear help to achieve that would be much appreciated. I feel like there should be a way to do this with conditionals, but I am struggling to find the right combination of information.
Have you got any code in Python? Is the information there or are you reading the information from the excel file, and then printing out or saving to the file? I didn't quite understand the question.

How to find/filter/combine based on common prefix in rows and columns with use of python/pandas?

I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.
How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix , and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities e.g. AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.
Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib .. still clueless on how to set this up
any help appreciated
Thanks!
Update:
#Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).
I tried to solve this by slicing the strings:
eng.Locality.str.slice(0,4,1)
i managed to slice off the suffices so that the remainders = AA3_ , AA4_ and so on.
Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.
eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()
Any more help out there? As you can see above - I need it :-)
In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.
Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:
sum = eng.groupby(["Locality"]).sum()
Additional commands and sorting styles can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:
Edited the Locality-ID to remove suffix:eng["Locality"]=eng["Locality].str.slice(0,4,1)
Used the groupby function:sum = eng.groupby(["Locality"]).sum()
End result:
Table

Simple DataFrame question from a beginner

I'm learning dataframe now. I've been stuck in how to get a subset of a dataframe or table with its label index. I know it's a very simple question but I couldn't find the solution in pandas documentation. Hope someone could help me. Appreciate your help.
So, I have a dataframe named df_teams like below:
enter image description here
If I want to get a subtable of a specific team 'Warriors', I can use df_teams[df_teams['nickname']=='Warriors'], resulting a row in the form of dataframe. My question is, what if I want to get a subtable of more teams, say I want information of both 'Warriors' and 'Hawks' to form a new table? Can I do something similar by using logical index and finishing in one line of code?
You could do a bitwise or on the two conditions using the '|' character.
df_teams[(df_teams['nickname']=='Warriors')|(df_teams['nickname']=='Hawks')]
Alternatively if you have a list of values you want to check against you could instead use the isin method to return rows that have one of the values present in the list.
E.g
df_teams[df_teams['nickname'].isin(['Warriors','Hawks'])]

Pandas and complicated filtering and merge/join multiple sub-data frames

I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')

unpack dictionary values in dataframe python

I have this dataframe from where I need to exact the act1omschr from the column adresactiviteit, however sinds it is an object with a list and dict I don't know how to extract these values.
Can someone help me out?
It looks like that's not a dictionary, but a 'json' (java script object notation). It's a bit like a csv but with nested values and pretty comumn especially for web data.
Pandas has a function called 'json_normalize' which should help. For specifically using it on one column, this was answered pretty well over here. You should more or less be able to use the exact code given.

Categories