Python - Reading files from Directory without single quotes - python

I am wondering if there is a way to read in files from your directory without single quotes around them. My directory has several geodataframes with the same suffix. I can read them all in and append to a list but the list will contain single quotes around each DataFrame name, making it a str.
dir()
'sub_geo_group_mangrove',
'sub_geo_group_marsh',
'sub_geo_group_multipolygon',
'sub_geo_group_pan',
'sub_geo_group_reedbed'
I am using these files to insert into a loop and therefore need the file names themselves without the single quotes
Edit
The data frames are all formatted as such:
sub_geo_group_mangrove.head(2)
geometry wetland
43 MULTIPOLYGON (((12.41310 -6.10921, 12.41274 -6... mangrove
59 POLYGON ((12.30671 -6.15770, 12.30654 -6.15741... mangrove
And I am attempting to read each DataFrame into the following lines:

I am pretty sure, what you are trying to achieve here, would have been really not necessary. But I would just answer what you ask and based on what I understand about your requirements.
As, you are trying to use the dataframe names in a loop structure I am assuming that the dataframes are available in the loop's scope.
which you have shown using dir() command in your example.
From python docs:
dir(...)
dir([object]) -> list of strings
If called without an argument, return the names in the current scope.
Now let's assume the list in which your DataFrame names are present is
dfnames = ['sub_geo_group_mangrove','sub_geo_group_marsh','sub_geo_group_multipolygon',...]
Based on what I understand about the problem you mentioned:
Now to loop over them you are not able to do something like:
for df in dfnames:
#do something with this df
cause the type(df) would be string.
At this point you can do couple of things. The most optimal of that, I think would be:
for name in dfnames:
df = vars()[name]
# do something with this df
vars() without argument is similar to locals().
vars() returns a dictionary, of variables used in your local scope.
The keys in this dictionary is variable names in string type, and the values associated with them is the data associated to those variables.

Related

Unexpected output when looping through subdictionaries in Python

I have a nested dictionary, where I have tickers to identifiy certain assets in my dictionary and then for each of these assets I would like to store characteristics in a subdictionary for the asset, creating them in a simple loop like the below:
ticker = ["a","bb","ccc"]
ticker_dict = dict.fromkeys(ticker, {"Var":[]})
for key in ticker_dict:
ticker_dict[key]["Var"] = len(key)
From the above output I would expect, that for each ticker/asset it saves the "Var" variable as the length of its name, meaning the following:
{"a":{"Var":1},
"bb":{"Var":2},
"ccc":{"Var":3}}
But, in my view rather weirdly, the result is this
{"a":{"Var":3},
"bb":{"Var":3},
"ccc":{"Var":3}}
To provide further context, the real process is that I have four assets, for which I would like to store dataframes in their subdictionaries as this makes it easy for me to access them later in loops etc. Somehow though, the data from the last asset is simply copied over all assets, eventhogh I explicitly loop through different keys.
What's going on?
PS: I'm not sure how to explain the problem without the sample code, so I might have missed a similar entry on this site. If so, any hints to it would be appreciated as well of course.
In your code, {"Var":[]} is only evaluated once, causing there to be only 1 inner dictionary shared by all keys. Instead, you can use a dictionary comprehension:
ticker_dict = {key:{"Var":[]} for key in ticker}
and it will work as expected.

Apply function to each element of multiple lists; return differently named dataframes

I have a function that returns specific country-currency pairs that are used in the following step.
The return is something like:
lst_dolar = ['USA_dolar','Canada_dolar','Australia_dolar']
lst_eur = ['France_euro','Germany_euro','Italy_euro']
lst_pound=['England_pound','Scotland_pound','Wales_pound']
I then use a function that returns a dataframe.
One of the parameters of this function is country-currency pair and the other is the period, from a list of periods:
period_lst = ['1y','2y','3y','4y','5y']
What I would like to do is to then get a list of dataframes, that will be then saved, each single one of them, to a different table, using SQLite3.
My question is how do I apply my function to each element of the lists of country-currency pairs and for each element of the period_lst and then obtain differently named dataframes as a result?
Ex: USA_dolar_1y
I then would like to be able to take each one of these dataframes and saved them to a table, in a database, that has the same name as each dataframe.
Thank you!
Whenever you think you need to dynamically name variables in Python, you probably want a dictionary:
def my_func(df, period):
# do something with period and dataframe and return the result
return df
period_lst = ['1y', '2y', '3y', '4y', '5y']
usa_dollar= {}
for p in period_lst:
usa_dollar[p] = my_func(df, p)
You can then access the various resulting dataframes (or whatever your function returns) by their period:
use_data(usa_dollar['3y'])
By the way: don't use capitals in your variable names, you should reserve CamelCase for class names and write function names and variable names in lowercase, separated by underscores for readability. So, usa_dollar, not USAdollar, for example.
This helps editors spot problems in your code and makes the code easier to read for other programmers, as well as future you. Look up PEP8 for more of these style rules.
Another by the way: if the only reason you want to keep the resulting dataframes in separate variables is to then write them to a file, you could just write the dataframe to the file once you've created it, and reuse the variable for the next one, if you have no immediate other need for the data you're about to overwrite.

What are the advantages of using column objects instead of strings in PySpark

In PySpark one can use column objects and strings to select columns. Both ways return the same result. Is there any difference? When should I use column objects instead of strings?
For example, I can use a column object:
import pyspark.sql.functions as F
df.select(F.lower(F.col('col_name')))
# or
df.select(F.lower(df['col_name']))
# or
df.select(F.lower(df.col_name))
Or I can use a string instead and get the same result:
df.select(F.lower('col_name'))
What are the advantages of using column objects instead of strings in PySpark
Read this PySpark style guide from Palantir here which explains when to use F.col() and not and best practices.
Git Link here
In many situations the first style can be simpler, shorter and visually less polluted. However, we have found that it faces a number of limitations, that lead us to prefer the second style:
If the dataframe variable name is large, expressions involving it quickly become unwieldy;
If the column name has a space or other unsupported character, the bracket operator must be used instead. This generates inconsistency, and df1['colA'] is just as difficult to write as F.col('colA');
Column expressions involving the dataframe aren't reusable and can't be used for defining abstract functions;
Renaming a dataframe variable can be error-prone, as all column references must be updated in tandem.
Additionally, the dot syntax encourages use of short and non-descriptive variable names for the dataframes, which we have found to be harmful for maintainability. Remember that dataframes are containers for data, and descriptive names is a helpful way to quickly set expectations about what's contained within.
By contrast, F.col('colA') will always reference a column designated colA in the dataframe being operated on, named df, in this case. It does not require keeping track of other dataframes' states at all, so the code becomes more local and less susceptible to "spooky interaction at a distance," which is often challenging to debug.
It depends on how the functions are implemented in Scala.
In scala, the signature of the function is part of the function itself. For example, func(foo: str) and func(bar: int) are two different functions and Scala can make the difference whether you call one or the other depending on the type of argument you use.
F.col('col_name')), df['col_name'] and df.col_name are the same type of object, a column. It is almost the same to use one syntax or another. A little difference is that you could write for example :
df_2.select(F.lower(df.col_name)) # Where the column is from another dataframe
# Spoiler alert : It may raise an error !!
When you call df.select(F.lower('col_name')), if the function lower(smth: str) is not defined in Scala, then you will have an error. Some functions are defined with str as input, others take only columns object. Try it to know if it works and then uses it. otherwise, you can make a pull request on the spark project to add the new signature.

when converting XML to SEVERAL dataframes, how to name these dfs in a dynamic way?

my code is on the bottom
"parse_xml" function can transfer a xml file to a df, for example, "df=parse_XML("example.xml", lst_level2_tags)" works
but as I want to save to several dfs so I want to have names like df_ first_level_tag, etc
when I run the bottom code, I get an error "f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
^
SyntaxError: can't assign to literal"
I also tried .format method instead of f-string but it also hasn't worked
there are at least 30 dfs to save and I don't want to do it one by one. always succeeded with f-string in Python outside pandas though
Is the problem here about f-string/format method or my code has other logic problem?
if necessary for you, the parse_xml function is directly from this link
the function definition
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
This seems like a situation where you'd be best served by putting them into a dictionary:
dfs = {}
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
dfs[first_level_tag] = parse_XML("example.xml", lst_level2_tags)
There's nothing structurally wrong with your f-string, but you generally can't get dynamic variable names in Python without doing ugly things. In general, storing the values in a dictionary ends up being a much cleaner solution when you want something like that.
One advantage of working with them this way is that you can then just iterate over the dictionary later on if you want to do something to each of them. For example, if you wanted to write each of them to disk as a CSV with a name matching the tag, you could do something like:
for key, df in dfs.items():
df.to_csv(f'{key}.csv')
You can also just refer to them individually (so if there was a tag named a, you could refer to dfs['a'] to access it in your code later).

Create dynamic names for lists [duplicate]

This question already has answers here:
Dictionary use instead of dynamic variable names in Python
(3 answers)
Closed 7 years ago.
From a shapefile I create a number of csv files but I don't know how many of them will be created each time. The csv files are named road_1, road_2 etc.
In these files, I have coordinates. I would like to put the coordinates of every csv files in lists.
So for road_1 I would like 3 lists:
x_1, y_1, z_1
For road_2:
x_2, y_2, z_2 etc.
I tried to name the lists in the loop where I get the coordinates with this : list+'_'+i where i is iterating through the number of files created, but i cannot concatenate a list and a string.
**
EDIT
**
Ok, some marked this topic as duplicated, fair enough. But just saying that I have to use a dictionnary doesn't answer all of my question. I had thought of using a dictionnary but my main issue is creating the name (either the name of the list, either the key of a dictionnary). I have to create this in a loop, not knowing how many I have to create. Thus, the name of the list (or key) should have a variable which must be the number of the road. And it's here that I have a problem.
As I said before, in my loop i tried to use a variable from the iteration loop to name my list but this didn't work, since it's not possible to concanate list with string. I could create an empty dictionnary with many empty key:value pairs, but I would still have to go through the keys name in the loop to add the values from the csv file in the dict.
Since it has been asked many times I wont write the code but only point you in the right direction (and maybe a different approach).
Use the glob module which will return the file names. Something like:
import glob
for csvFileNames in glob.glob("dir/*.csv"):
will return you each filename into the variable csvFileNames.
Then you simply open you csv Files with something like:
with open(csvFileNames, "r") as filesToRead:
for row in filestoRead:
#the structure of you csv File is unknown so cannot say anything here
Then its simple. Find you columns your interested in and create a dicts with the variables you need as keys. Use a counter to increment. All the information is there!

Categories