Georgia Tech OMSA (Georgia Institute of Technology Online Master of Science in Analytics)

If you cannot correctly download/open someones docx file for grading from edX (eg it shows up as a zip file or something unintelligible), rename the file so that it has a .docx extension at the end. It should now be able to be opened by MS Word.
If you wish to upload an RStudio markdown file to edX, rename the '.rmd' extension to '.txt' and advise the person grading the assignment to rename it to '.rmd' after downloading.
If you find that chunks of code or text are somehow randomly excluded from your pdf files after converting from markdown format to pdf, try splitting up code or text blocks into smaller blocks and re-attempt the conversion.

Essence of calculus
Good introductory video for R
Good notes for SVM
Good video to understand SVM
R package to display formulas from models
summary of difference between knn and k-means
Good video for explaining k-means
Visualising PCA
ISYE6501 lecture transcripts
advice for studying for ISYE6501
ISYE6501 Slack Channel
Good video for Eigen values and vectors
time series and forecasting
useful links for learning Arena
good comparison of models part 1
good comparison of models part 2
optimising shelf space for linear programming
Dedupe for combining data sets and entity resolution
linear algebra course
ISYE6501 quizlet
python for data science cheat sheet
machine learning project checklist
plagiarism detection system used by Georgia Tech
visualising PCA
exponential smoothing

Also of interest:

how to deal with imbalanced data sets 1
how to deal with imbalanced data sets 2

Matt

In May this year I began an Analytics: Essential Tools and Methods MicroMasters developed by the Georgia Institute of Technology (GTx) and delivered through the massive open online course (MOOC) provider EdX.

Subjects offered within the MicroMasters are exactly the same as those provided for the highly respected 'Online Master of Science in Analytics' (OMSA) offered by Georgia Tech and can be directly credited towards the full masters program. One caveat to consider, however, is that the credited subjects cannot contribute towards one's GPA in the OMSA.

I'm happy to say that I have received my certificate for completing the first subject in the MicroMasters program: "Introduction to Analytics Modeling". This subject was quite challenging, particularly in the context of a condensed US university Summer semester and I can honestly say that I learned a lot. A highlight for me was to learn how to use Simpy, a discrete-event simulation framework based on Python, for process optimisation. Subsequently, as part of my final project, I became aware of Syngenta's use of a range of analytical tools (including discrete event simulation) to gain estimated savings of $287 million across their seed product development pipeline between 2012-16 . That's quite an impressive statistic. It was also great to get a look under the hood of a range of machine learning methods for data analysis and my R coding skills definitely got a boost.

I was additionally quite excited to use the EdX platform for the first time. EdX is the only major MOOC platform that is non-profit and open-source. You can view a recent interview with Anant Agarwal, the founder of EdX, here. He really does have a vision for making education affordable worldwide.

I would like to say that my experience was 100% positive, however, unfortunately there were technical issues along the way. Sometimes I couldn't upload/download certain file types for the assignments. Sometimes I would spend hours grading someone's assignments and then find I was not allowed to submit the grade. Worst of all however, the exam monitoring software crashed in the middle of my final exam (worth 25% of the final grade) and I subsequently received a 0 for it. [After some negotiation I was eventually able to receive an imputed score for the exam and have it displayed in edX].

Despite these difficulties, I'm continuing onwards with the remainder of the MicroMasters. Who knows, I might even get some time to fire up a Docker container with the EdX platform myself and have a look at fixing some bugs. One thing I'd particularly like to see addressed is the platforms overall ability to cope with unstable internet connections. After all, unstable internet connections remain common place throughout the world (and are inherent with wireless/mobile).

As a final note, I'm left wondering why no Australian universities offer cost-effective online masters courses for Analytics/Data Science like the Georgia Institute of Technology. The University of Adelaide has a ~1.4 year Graduate Diploma in Data Science for ~31K AUD. The University of New south Wales has a ~2 year Master of Data Science for ~50K AUD. RMIT has a ~2 year Master of Data Science Strategy and Leadership for ~42K AUD. Finally, James Cook University has a 2.7 year Master of Data Science for ~53K AUD. All of these options hardly compare to a ~14K AUD (9.9K USD) cost for the ~3 year OMSA or Online Master of Science in Computer Science (OMSCS) offered by Georgia Tech.

Matt

CSE6040:

Glossary of mathematical symbols

Python regex regular expressions

Fall 2020 CSE6040 slack channel

Fall 2020 CSE6040 schedule

Fall 2020 course schedule

Spring 2021 Slack Channel

CSE60040 Spring 2021 syllabus

CSE60040 Spring 2021 timetable

MSA Python Bootcamp

Jupyter Notebooks cheatsheet

The Elements of Statistical Learning

Download all files in a path on Jupyter notebook server

LinAlg_Python_Review_Notebook

Introduction to Python

Python Pandas Tutorial: A Complete Introduction for Beginners

Whirlwind Tour Of Python

Algorithms, DataStructures and Big O Notation

Search algorithms in Python

Fibonacci sequence

Hint for Problem5 of the "More Python Exercises on Vocareum":

You can use the list.count(element) method to determine how many times element appears in list. Use a sanity check for when the list contains fewer than 2 elements.

Python a, b = b, a +b

String formatting in Python

Comparing boolean and int using isinstance

Python dictionaries

Kahan summation algorithm

website for testing regex

5 Must-Know Applications of Singular Value Decomposition (SVD) in Data Science

Python:
https://ehmatthes.github.io/pcc/cheatsheets/README.html
https://learnxinyminutes.com/docs/python/
https://sites.engineering.ucsb.edu/~shell/che210d/python.pdf
http://www.eas.uccs.edu/~mwickert/Python_Basics.pdf
Pandas:
https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
Numpy:
https://cs231n.github.io/python-numpy-tutorial/
https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf
Regex:
https://www.regular-expressions.info/characters.html
https://regexone.com/lesson/introduction_abcs
https://www.debuggex.com/cheatsheet/regex/python
https://www.pythonsheets.com/notes/python-rexp.html
Data Structures, List Comprehension
https://spapas.github.io/2016/04/27/python-nested-list-comprehensions/
https://livebook.manning.com/book/python-workout/chapter-4/v-3/92
https://docs.python.org/3/tutorial/datastructures.html
https://openbookproject.net/thinkcs/python/english3e/dictionaries.html
https://www.cs.cornell.edu/courses/cs1110/2018sp/lectures/lecture14/presentation-14.pdf
https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/

Notes for Notebook 2:

∀ is read "for all"
⇒ is read as "implies"
∈ Denotes set membership, and is read "in" or "belongs to". That is, x ∈ S means that x is an element of the set S
A=set of possible pairs
⊥ denotes the logical predicate always false. As a binary operator, denotes perpendicularity and orthogonality.

How to download notebooks from Vocareum:

Method 1:

Before opening the notebook - check the box next to the notebook name (ie ‘part0.ipynb’) and a download button pops up toward the top of the screen.

Method 2 (this works!):

To download all the contents of the notebook (including starter code and the notebook version after you've worked on it), select any notebook and enter this command in a cell:

!tar --exclude='*/.voc' --exclude='*/.voc.work' -chvzf notebook.tar.gz ../../*

This will create notebook.tar.gz file in the same folder as the notebook you've entered this command. For example for notebook 1, I ran this command in 0-basics.ipynb, it created the .gz file, that you can open with 7-zip, program in windows for example:
The Structure of the unzipped folder will like this snapshot, a folder each part (part0, 1, and 2). Inside each folder you can see the starter code in "resource folder" and the modified version after your worked on it in "work folder"
Final note: notice the "../../" part of the commands goes two level up, so it should be enough to capture the different parts of a certain notebook. I believe if you used "../../../" you will download all the notebooks (available so far) in one shot.

Method 3 (this works (although not for all notebooks)):

Navigate to the folder containing the notebook in Vocareum (often using the menu button in the top left corner)
In the Actions menu (top, right corner) choose 'Download starter code' and save it to a folder on your PC. Unzip the folder on your pc and navigate to the 'startercode.0' folder. The ipynb file will be in there. Install Jupyter on your PC so you can open it.
If there are any data files associated with the notebook you will need to download them too. In the 'startercode.0' folder on your PC, create a 'resource' folder, then an 'asnlib' folder inside, and then a 'public data' folder inside that. Navigate to the folder containing the notebook in Vocareum. Click on the 'resource' folder, then the 'asnlib' folder and then 'public data'. Download all the contents into the 'public data' folder on your PC.

Notes for Notebook 3:

≡ Denotes an identity, that is, an equality that is true whichever values are given to the variables occurring in it. In number theory, and more specifically in modular arithmetic, denotes the congruence modulo an integer.

Notes for Notebook 4:

≪ , ≫ Mean "much less than" and "much greater than". 
ϵ , machine epsilon
∥x∥ is the norm of x (i.e. the length of the vector x)

online decimal to binary converter

Kahan summation algorithm
Accurate algorithms for computing sums
Floating Point Arithmetic: Issues and Limitations <-very useful

Notes for Notebook 5:

good online regex tester/editor

Notes for Notebook 7:

Inner-join: Keep only rows of A and B where the on-keys match in both.
Outer-join: Keep all rows of both frames, but merge rows when the on-keys match. For non-matches, fill in missing values with not-a-number (NaN) values.
Left-join: Keep all rows of A. Only merge rows of B whose on-keys match A.
Right-join: Keep all rows of B. Only merge rows of A whose on-keys match B.

good reference for sorting pandas data frames

How to define dataframes in pandas:

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
     },
     index=[0, 1, 2, 3],
 )

#################
#Alternative method: #
#################

df=pd.DataFrame(columns=['C1','C2','C3'],
                 data=list(zip(['A','B','C'],
                               [1,3,5],
                               [2,4,6]))
                )

How to slice and select data from pandas data frames in Python:
link

In [68]: df1
Out[68]: 
           0         2         4         6
0   0.149748 -0.732339  0.687738  0.176444
2   0.403310 -0.154951  0.301624 -2.179861
4  -1.369849 -0.954208  1.462696 -1.743161
6  -0.826591 -0.345352  1.314232  0.690579
8   0.995761  2.396780  0.014871  3.357427
10 -0.317441 -1.236269  0.896171 -0.487602

#########################
Select via integer slicing:
#########################
In [69]: df1.iloc[:3]
Out[69]: 
          0         2         4         6
0  0.149748 -0.732339  0.687738  0.176444
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161

In [70]: df1.iloc[1:5, 2:4]
Out[70]: 
          4         6
2  0.301624 -2.179861
4  1.462696 -1.743161
6  1.314232  0.690579
8  0.014871  3.357427

#########################
Select via integer list:
#########################

In [71]: df1.iloc[[1, 3, 5], [1, 3]]
Out[71]: 
           2         6
2  -0.154951 -2.179861
6  -0.345352  0.690579
10 -1.236269 -0.487602

In [72]: df1.iloc[1:3, :]
Out[72]: 
          0         2         4         6
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161

In [73]: df1.iloc[:, 1:3]
Out[73]: 
           2         4
0  -0.732339  0.687738
2  -0.154951  0.301624
4  -0.954208  1.462696
6  -0.345352  1.314232
8   2.396780  0.014871
10 -1.236269  0.896171

Notes for notebook 10:

My explanation of CSR:

    cols   = [1, 2, 4, 0, 2, 3, 0, 1, 3, 4, 1, 2, 5, 6, 0, 2, 5, 3, 4, 6, 3, 5]
    values = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    rowptr = [0,       3,       6,         10,         14,      17,      20,   22]

You use the values in rowptr as indices to slice cols and values up into rows that contain non-zero values. Eg. rowptr values 0 and 3 are used to slice cols into [1,2,4] and values into [1,1,1]. rowptr values 3 and 6 are used to slice cols into [0,2.3] and so on....

Notes for notebook 11:

How to connect to a database with Python sqlite3 and do a query:

    conn = load_db('dbs/Database1.db')
    queryz="""SELECT * FROM Database1 LIMIT 5"""

    ###METHOD WITHOUT pandas###
    cur = conn.cursor()
    cur.execute(queryz)
    all_results = cur.fetchall()
    print(all_results)

   ###METHOD WITH pandas#######
    df=pd.read_sql_query(queryz, conn)
    print(df)

INNER JOIN(A, B): Keep rows of A and B only where A and B match
OUTER JOIN(A, B): Keep all rows of A and B, but merge matching rows and fill in missing values with some default (NaN in Pandas, NULL in SQL)
LEFT JOIN(A, B): Keep all rows of A but only merge matches from B.
RIGHT JOIN(A, B): Keep all rows of B but only merge matches from A.

How to left join a table twice to another table with Python sqlite3 and rename the columns:

###Table1###
#	ID 	A 	B
#0 	0 	0 	1
#1 	1 	0 	6


###Table2###
# 	ID 	X 	Y
#0 	0 	167 	556
#1 	1 	153 	766
#1 	6 	088 	942

query = '''
        SELECT Table1.ID AS IDZ, Table1.A AS AZ, Table2a.X AS AX, Table2a.Y AS AY, Table1.B AS BZ, Table2b.X AS BX, Table2b.Y AS BY
        FROM Table1
        LEFT JOIN Table2 AS Table2a
        ON Table1.A = Table2a.ID
        LEFT JOIN Table2 AS Table2b
        ON Table1.B = Table2b.ID        
      
'''
###Results###
#	IDZ 	AZ 	AX 	AY 	BZ 	BX 	BY
#0 	0 	0 	167 	556 	1 	153 	766
#1 	1 	0 	167 	556 	6 	088 	942

Notes for notebook 12:

A very good explanation of derivatives
Calculus Made Easy

Notes for notebook 13:

Notes for notebook 15:

good video explanation of PCA
excellent video explanation of singular value decomposition

image (1).png

Matt

MGT6203:

Resources on how to learn R:

R for Datascience: http://r4ds.had.co.nz/
RStudio Education: https://education.rstudio.com/
Swirl: SwirlStats.com
DataCamp: DataCamp.com/courses/free-introduction-to-r

Difference between one hot encoding and dummy variables:

One hot encoding:
-- Red: 1,0,0
-- Blue: 0,1,0
-- Green: 0,0,1
Dummy:
-- Red: 1,0
-- Blue: 0,1
-- Green: 0,0

How to interpret a regression model when some variables are log transformed

Different types of means (where a and b are two values and n is the number of values):

Arithmetic mean (AM): sum(a+b)/n
-- Use when data is in an additive relationship.
Geometric mean (GM): nth root of (ab) = sqrt(ab)
-- Use when the data is in an multiplicative relationship (or compounded eg with compounding interest)
Harmonic mean (HM): n/(1/a + 1/b)
-- Use for rates

HM<GM<AM

======================================================
dependant-independent log transformed models:

lin-log:

A P % change in the untransformed independent variable x changes the untransformed dependent variable y by (P/100) units.
The model is: y=b0+b1*log(x)
Therefore, a P % change in the untransformed independent variable x changes the untransformed dependent variable y by ((P/100)*b1) units.

log-lin:

A 1 unit change in the untransformed independent variable x changes the untransformed dependent variable y by 100%.
A 1 unit change in the untransformed independent variable x changes the log-transformed dependent variable y by 1 unit.
The model is: 𝑦=𝑒^(𝑏0+𝑏1𝑥)
Therefore, a 1 unit change in the untransformed independent variable x changes the untransformed dependent variable y by (100*B1)%.
NB. The interpretation above only works when b0+b1*x is very small. The accurate percentage change in y is actually ((exp(B1))-1)*100 % for a one unit change in x.

log-log:

A 100 % change in the untransformed independent variable x changes the untransformed dependent variable y by 100%
A 1 unit change in the log transformed dependent variable y changes the log-transformed dependent variable y by 1 unit.
The model is: log(y)=b0+b1*log(x)
Therefore, a 100 % change in the untransformed independent variable x changes the untransformed dependent variable y by (100*b1)%

NB. log to % relationship is only valid up to around 20%

================================================
Odds against:

ratio of failure:success (where failure is the number at the beginning)
If the odds are 10:1 against and you bet $1 and win you receive $10 + $1

Odds for (also called odds on):

ratio of success:failure (where success is the number at the beginning)
If the odds for are 10:1 for and you bet $10 and win you receive $10 + $1
odds for= p/(1-p)
eg. 3:2=p/(1-p)=(3/(3+2))/(2/(3+2))=0.6/0.4=3/2

To convert odds for to adds against, just reverse them (eg. 2:1 becomes 1:2)

Logistic regression model:

p=exp(a+bx)/(1+exp(a+bx))
odds for = exp(a+bx)=p/(1- p)=p (will happen)/p(won't happen)
1-p=1-(exp(a+bx)/(1+exp(a+bx)))=1/(1+exp(a+bx))
log(p/(1- p)) is the output of the regression model. It is the log odds. It is logit(p)=log(p/(1- p))=a+bx=log(exp(a+bx)). Use exp(log(p/(1- p))) to find the odds. The odds is p/(1-p)
p =(odds for)/(1+odds for) = p/(1-p) / (1 + (p/(1-p))) = probability of y=1 = probability of success
Probability of failure is 1-p.
A 1 unit change in the untransformed independent variable x changes the untransformed dependent variable y (=odds for = p/(1- p)) by (b*100)%. This means the 'odds for' (i.e probability success/probability failure) will change by b, holding all other variables fixed. More precisely it is: (((exp(b))-1)*100)%
A 1 unit change in the untransformed independant variable x changes the log (y) transformed variable by b. This means the natural log of the 'odds for' (i.e probability success/probability failure) will change by b, holding all other variables fixed.
p=p(x)
p ranges between 0 and 1
if p >0.5 then y=1 otherwise y=0
intercept = log odds when all other terms=0

Good explanation of the difference between probabilty and odds

A good introduction to Linear Regression Models with Logarithmic Transformations

a good explanation of Bayes Theorum

Unbiased means that "on average" the estimate will be correct.

Covariance value has no upper or lower limit and is sensitive to the scale of the variables.
Correlation value is always between -1 and 1 and is insensitive to the scale of the variables.

Operating performance: Operating performance measures results relative to the assets used to achieve those results. The focus of determining Operating Performance is on how well assets are converted into earnings, and how efficiently resources are used to generate revenue.

How to install Data Explorer:

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer")

Matt

Brief MGT6203 Review

MGT6203 "Data Analytics for Business" is an easy A if your work/family/other commitments permit it.

The program was quite useful to consolidate regression model, forecasting and queue theory. In addition, learning the fundamentals of factor investing and measuring risk adjusted performance was a particular highlight for me. I also learned a thing or two about digital marketing and it was great that Professor Bien regularly chatted with his students.