I've been trying to create a table that has randomly generated data using Pandas and Numpy. I've looked at the cheat sheet for Pandas but still can't get this work
import names
import pandas as pd
import random
import numpy as np
random.seed(100)
currency_numbers = random.sample(range(100000, 1000000), 100)
s = pd.Series(np.random.randn(100))
raw_data = {
"Names":["".join(names.get_full_name()) for i in range(100)],
"Names2":["".join(names.get_full_name()) for i in range(100)],
"Currency":[]
}
df = pd.DataFrame(raw_data, columns=["Names", "Names2", "Currency"])
df.head()
I want to create a column of 100 random numbers under the Currency section?
Just use the function: np.random.randint()
For example when I call this --> np.random.randint(1000,size=100)
The largest integer to be chosen in the random function is 999 aka anything from [0, 1000) and the size of the array would have a length of 100.
Therefore in your case,
s = np.random.randint(1000,size=100)
then set Currency to s,
"Currency":s
and the resulting DataFrame should give a column with 100 random numbers
JUST FYI, with this function you can also set a low and a high range...
So in your case it would be something like this:
s = np.random.randint(100000, 1000000,size=100)
Please check whether this helps.
import names
import pandas as pd
import random
import numpy as np
random.seed(100)
currency_numbers = np.random.randint(100000,1000000,size=(1,100))
s = pd.Series(np.random.randn(100))
raw_data = {
"Names":["".join(names.get_full_name()) for i in range(100)],
"Names2":["".join(names.get_full_name()) for i in range(100)],
"Currency":currency_numbers[0]
}
df = pd.DataFrame(raw_data, columns=["Names", "Names2", "Currency"])
df.head()
Related
I have a column that includes strings including a percent at the end e.g XX: (+2, 30%); (-5, 20%); (+17, 50%) .
I need to extract the highest % value for each such string and perform this on the whole column.
Any advice will be highly appreciated!
Thanks
In my understanding, each cell in column XX is a cells which contains some percentages. I have included a small test DataFrame I have used:
import pandas as pd
import re
df = pd.DataFrame({"XX":["(+2, 30%), (-5, 20%), (+17, 50%)","(+2, 70%), (-5, 20%), (+17, 50%)", ""]})
pattern = re.compile("([0-9\.]+)%")
df["XX"].apply(lambda x: max(pattern.findall(x), default=-1))
OUTPUT
0 50
1 70
this code returns the most value in a column having percents
import pandas as pd
import numpy as np
data = [['2.3%', 1],['5.3%', 3]]
data = pd.DataFrame(data)
first_column = data.iloc[:, 0]
percent_list = []
for val in first_column:
percent_list.append(float(val[:-1]))
print(percent_list[np.argmax(percent_list)])
I have the following code
import pandas as pd
from pandas_datareader import data as web
import numpy as np
import math
data = web.DataReader('goog', 'yahoo')
df['lifetime'] = data['High'].asfreq('D').rolling(window=999999, min_periods=1).max() #To check if it is a lifetime high
How can i compare it so that i get a boolean (in 1 and 0 preferably) if df['High'] is close to its df['lifetime'] for each row in pandas:
data['isclose'] = math.isclose(data['High'], data['lifetime'], rel_tol = 0.003)
Any help would be appreciated.
You can use np.where()
import numpy as np
import math
data['isclose'] = np.where(math.isclose(data['High'], data['lifetime'], rel_tol = 0.003), 1, 0)
You could also use pandas' apply() function:
import math
from pandas_datareader import data as web
data = web.DataReader("goog", "yahoo")
data["lifetime"] = data["High"].asfreq("D").rolling(window=999999, min_periods=1).max()
data["isclose"] = data.apply(
lambda row: 1 if math.isclose(row["High"], row["lifetime"], rel_tol=0.003) else 0,
axis=1,
)
print(data)
However, yudhiesh's solution using np.where() is faster.
See also: Why is np.where faster than pd.apply
I have the following dataframe:
df = pd.DataFrame({'A':range(10), 'B':range(10), 'C':range(10), 'D':range(10)})
I would like to shuffle the data using the below function:
import pandas as pd
import numpy as np
def shuffle(df, n=1, axis=0):
df = df.copy()
for _ in range(n):
df.apply(np.random.shuffle, axis=axis)
return df
However I do not want to shuffle columns A and D, only columns B and C. Is there a way to do this by amending the function? I want to say if column == 'A' or 'D' then don't shuffle.
Thanks
You could shuffle the required columns as below:
import numpy as np
import pandas as pd
# the data
df = pd.DataFrame({'A':range(10), 'B':range(10),
'C':range(10), 'D':range(10)})
# shuffle
df.B = np.random.permutation(df.B)
df.C = np.random.permutation(df.C)
# or shuffle this way (in place)
np.random.shuffle(df.B)
np.random.shuffle(df.C)
If you need to shuffle using your shuffle function:
def shuffle(df, n=1):
for _ in range(n):
# shuffle B
np.random.shuffle(df.B)
# shuffle C
np.random.shuffle(df.C)
print(df.B,df.C) # comment this out as needed
return df
You do not need to disturb columns A and D.
I would like to know if there is an elegant way to sum pd.DataFrame with exact same indexes and column using the Xarray package.
The problem
import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(123)
pdts = pd.Index(["AAPL", "GOOG", "FB"], name="RIC")
dates = pd.date_range("20200601", "20200620", name="Date")
field_A = pd.DataFrame(np.random.rand(dates.size, pdts.size), index=dates, columns=pdts)
field_B = pd.DataFrame(np.random.rand(dates.size, pdts.size), index=dates, columns=pdts)
field_C = pd.DataFrame(np.random.rand(dates.size, pdts.size), index=dates, columns=pdts)
df_dict = {
"A": field_A,
"B": field_B,
"C": field_C,
}
What I would like to obtain is the res = df_dict["A"] + df_dict["B"] + df_dict["C"] using the Xarray package, which I just started learning. I know there are solutions using Pandas like:
res = pd.DataFrame(np.zeros((dates.size, pdts.size)), index=dates, columns=pdts)
for k, v in df_dict.items():
res += v
Attempts
What I have tried in Xarray :
As the Dataset class looked like a dict of datas, I thought the most straightforward option would be this :
ds = xr.Dataset(df_dict)
However when performing ds.sum() it won't allow me to sum along the different data variables, the result is either sum over "Date" or sum over "RIC" or over both, but performed for each data variable.
Any idea ? Thanks in advance.
Looks like a way to do it is ds.to_array().sum("variable")
I am trying to do the equivalent of a COUNTIF() function in excel. I am stuck at how to tell the .count() function to read from a specific column in excel.
I have
df = pd.read_csv('testdata.csv')
df.count('1')
but this does not work, and even if it did it is not specific enough.
I am thinking I may have to use read_csv to read specific columns individually.
Example:
Column name
4
4
3
2
4
1
the function would output that there is one '1' and I could run it again and find out that there are three '4' answers. etc.
I got it to work! Thank you
I used:
print (df.col.value_counts().loc['x']
Here is an example of a simple 'countif' recipe you could try:
import pandas as pd
def countif(rng, criteria):
return rng.eq(criteria).sum()
Example use
df = pd.DataFrame({'column1': [4,4,3,2,4,1],
'column2': [1,2,3,4,5,6]})
countif(df['column1'], 1)
If all else fails, why not try something like this?
import numpy as np
import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame(data=np.random.randint(0, 100, size=100), columns=["col1"])
counters = {}
for i in range(len(df)):
if df.iloc[i]["col1"] in counters:
counters[df.iloc[i]["col1"]] += 1
else:
counters[df.iloc[i]["col1"]] = 1
print(counters)
plt.bar(counters.keys(), counters.values())
plt.show()