Big data systems, scalable machine learning algorithms, health analytic applications, electronic health records.
Syllabus
Syllabus not available.
Textbooks
No textbooks listed.
S
strong-orbit-1242
January 7, 2026spring 2025
Great class overall to take near the end of the ML spec. The final project is very well thought out with topics given to us to choose from rather than us cooking up topics.
The final project is done in duos so its not that bad and it can be as hard as you want it to be. I learnt a lot building and training a model from scratch for the final project.
This class wasn't as useful as I thought it would be. Coming in I expected to learn a lot about Big Data and how to apply concepts to different projects but it was mostly just PySpark assignments. Exam is short but not awful if you study with the lectures.
Quite easy to get an A, very easy to get a B. Homework code is graded using an autograder, so you can just resubmit until correct.
Definitely a lower workload than previous reviews suggest, but very front loaded (the final paper was more chill). If you don't come in with strong coding experience, you will struggle on the HW because the course offers no guidance. The lectures only discuss theory, and the HW are skeleton code you have to figure out. I felt like the HW was just me bashing my keyboard until the autograder passed.
Horribly organized course. Instructions for almost everything were unclear, and TAs were unresponsive. The final felt more like random trivia than an evaluation of our understanding. There's no guidance on how to study, and it included topics from the "optional" labs and several things that were never explicitly covered. The other assignments are graded pretty leniently, so try to get 100s on all the homeworks to have some leeway here.
The assignments were:
Data ETL and prediction in Python
Data ETL and logistic regression in PySpark (including some calculus to derive formulas)
Rule-based and clustering methods for diabetes phenotyping in PySpark
Deep learning for mortality prediction (MLPs, CNNs, and RNNs), including a Kaggle competition among the class.
Final project: replicating a ML paper with a teammate. If you pick a paper with a repo, this is pretty painless (but still time-consuming)
TL;DR:
Poorly organized course, but easy to do well if you put in the time (or have lots of ML/DL coding experience). I got an A but feel like I learned almost nothing.
I believe the course content has been changed a few semester earlier. The workload now is less and I believe can be managed in combination with another class if you are planning that. Overall if anybody is looking for basics of big data tools, healthcare data concepts and ML, this course would serve as a very good introduction and get more details. The course videos serve well to introduce each topic. TA are very helpful if you are struggling with concepts. Only change would be if the course can be restored to old difficulty/time commitment, would be even better.
I'd say the content is not bad honestly - it’s good, even the assignments, exam and final project are nice - there’s a lot of learning.
But a few factors made the course very disappointing for me - doing this with the Bayes course, the TA and staff involvement was the polar opposite. There have been literally zero discussions about things, instead the open questions asked by the students are left unanswered for days, sometimes never answered. And there are many errors in the hws - which the TAs don’t care to rectify even though it would be very easy to do (although not all TAs, a couple were really good). It’s like they literally want to spend the least time possible for the course.
Other cons - some of the lectures are nice and cover a huge breadth of topics that are very interesting and relevant to big data and even system design. But I feel the lectures are too shallow to be able to cover such a wide variety of topics, some of which are really complicated concepts. For most of the ones which were new to me, I had to supplement with YouTube videos. I really like the labs but they are of course outdated and totally forgotten- I really wish they would put more spotlight on it and improve them.
I am really triggered by how badly such a great and important course was managed and run. Like I said I can assure everyone the quality of topics covered is great, there’s a lot to be learned which could be a great value add to ml,DL etc.
My prior experience consists of a bachelor's in CS from Georgia Tech, having taken undergrad AI, ML, and CV and 1 YoE in data engineering. This was my first course in OMSCS.
Overall, having received a high A, I felt that this course was not that difficult, however the homework at some points felt time consuming.
Firstly, there are four homework assignments. The homework stresses a lot of importance on joins and filtering data with Python (pandas and Pyspark mostly), and the last one is DL related. There are coding, calculations, and reports in each homework assignment. Some of the calculations in the later HW felt like busy work and took a while. The homeworks took roughly 10-20 hours each, and there are 2 weeks given to complete each one.
There are a wider variety of topics provided in the short lecture videos. I found that the lectures are a bit disconnected from homeworks, as the lectures are mostly high level information. There are also ungraded hands-on labs exploring topics like Hadoop, Scala, and DL in a provided Docker image. The final consists of multiple choice questions based on information from the lectures, so simply watching the videos and reviewing before the final should prepare you just fine.
The final project is straightforward and graded leniently. If you want to have an easy time on this, select a paper with a code repo provided.
This course has changed a lot from what I can gather from past warnings and reviews. It's not hard, and not especially time consuming.
Generally, the most challenging part of the assignments was data wrangling, but this seemed to be of secondary importance as far as the lectures went.
Assignments were reasonable; sometimes I got a few points taken off for things I wasn't quite sure were actually wrong, but I did solidly get an A overall.
The group project, as many mentioned, is also graded leniently, and more on process than results, which does seem sensible.
The final is absolute garbage, and is a mix of trivia questions from the lectures and trivia questions which may have used to be in the lectures.
I think this course would do well as a lighter/one semester alternative to ML/DL with some healthcare focus. I think that MPH/Epi students could use something like this, actually.
There are some missed opportunities; I'd love to explore the semantic attributes of medical coding/ontologies, but that's not what this course is really about. Additionally, it'd be nice if the course went more practical into MLE kind of stuff. However, the course does neither of these things now, so if you've taken ML/DL already, I'm not sure what you'd get out of this, especially if you're not in health.
The course syllabus have changed and Scala and Hadoop was removed, and now the course is manageable below 10 hours per week.
However, I felt like I didn't learnt much, there's some simple spark processing and a really cool kaggle competition, but most of the content is simply some data wrangling with spark / numpy / pandas.
The project on healthcare paper reproduction is also interesting. Choose a paper that is not difficult with Github written with clean code and you should do fine.
Overall, I felt I've learnt more about Machine Learning than Big Data.
My perspective is a CS student in the ML specialization taking this course. Prior to this course, I had taken ML, RL, and DL.
The bad:
First, the homework assignments were not something I was a huge fan of. They had a somewhat similar template to DL; little bit easier but they came with some tedious aspects. They seem to have changed assignments this semester and some directions were ambiguous and not well communicated, resulting in needing to spends many extra hours. I think I got something from the HW assignments given prior background, getting really good at joins and filtering of data.
I think you can get more out of them if you want to. For me, they were a blur of stress and work to slash through.
The good:
The final project was the best part of the class. We replicated a research paper. I felt like I really learned a lot about some specific deep learning algorithms, data processing, natural language processing, and SQL based on the nature of our selected paper. Your results will vary based on the paper you select and the work you put into it. I felt this part of the class alone was worth doing and will have career/resume benefits for me.
The lectures were concise. Some bits weren't great, a complex topic would be thrown at you bird's eye such that you will not learn it without a lot of your own research or prior exposure. All-in-all though I think the lectures were a good aspect to this class. A lot of them were good reviews in topics from ML/DL, and would be an okay decent first exposure if you didn't have that background.
The neutral:
I have not taken the final yet but based on other reviews and the format, it doesn't seem to be a major component of the class. Mainly a reason to watch the lectures at least once.
Grading:
Grading is generous if you do the work. But yes, there is a lot of work..
To succeed in this course, students need to have certain skills and knowledge beforehand. They should be familiar with concepts such as classification and clustering in machine learning and data mining. Proficiency in programming languages like Scala, Python, and Java is also necessary. Knowing how to work with data and understand the ETL process, including skills in SQL and NoSQL like MongoDB, is recommended.
Having these skills is important to do well in the course. Without them, it can be overwhelming, like drinking from a fire hose. The course requires students to go through lectures, understand technology, and implement what they learn on their own. The course covers medical data properties and data mining issues related to healthcare applications such as predictive modeling, computational phenotyping, and patient similarity. Students will also learn about big data analytics technology and its uses, which can also be applied in other sectors.
The course includes five homework assignments (50%), a project (25%), a final exam (20%) , and participation (5%).
Homework1: On the very first day of class, we were tasked with completing the CITI certification to ensure the utmost care and respect when handling sensitive medical data. But that was just the beginning! We delved into the exciting world of descriptive statistics, feature engineering, predictive modeling, and model validation, all leading up to the ultimate challenge: creating the best model. Personally, I had a blast and was able to complete all my schoolwork in under 20 hours. However, for those who are not well-versed in python, sklearn, numpy, and sql fundamentals, it may be a challenging journey ahead
Homework2: Just like homework one, but with a Pyspark twist! Pyspark is an awesome Python interface for Apache Spark that lets us build spark applications and analyze data using Python APIs. As I dove deeper into the project, I got to explore some really interesting concepts like RDD, Spark, and execution plans. And the best part? We didn't just stop at descriptive analysis - we also tackled feature engineering, created an svm lite dataset, and even implemented SGD Logistic Regression. Needless to say, this project was a standout experience!
Homework3: We got to work with Scala, implementing Rule based phenotyping and then diving into Unsupervised phenotyping using Clustering with K-Means, GMM, and Streaming K-means. I have mixed feelings about Scala; while it's type safe, debugging can be challenging. With enough practice, though, I believe anyone could become comfortable with it. Personally, I still prefer Pyspark. Overall, it was a pretty cool assignment!
Homework4: We tackled a Scala-based homework that involved some exciting graph modeling. Our task was to represent Electronic Health Record (EHR) data using the Graph X model. To accomplish this, we implemented both the Random Walk with Restart and Power Iteration algorithms.
Homework5: Out of all the homework assignments, the one that brought me the most joy was working with Pytorch to build various models, including MLPs, CNNs, RNNs, and custom RNNs. It felt like a refresher of the deep learning class I took before, but I managed to complete it in just one week thanks to the techniques I learned there. The highlight of the task for me was calculating the trainable parameters and FLOPS for these complex models, which made the whole experience a lot of fun!
OMSA Computation data track more reviews here : https://www.linkedin.com/pulse/georgia-tech-omsa-program-review-sid-gudiduri/