Комментарии:
"RDD is basically an array distributed across the cluster" - genius
ОтветитьComputerphile will be excited to learn that tripods exist.
ОтветитьWhat useless video : - slow down, explain slow , assume audience know not much
Ответитьthanks
ОтветитьShe must really know this stuff. Very well explained. You can always tell someone that actually knows content by how simply they can describe it.
ОтветитьThis was very helpful
ОтветитьGreat explanations. Of course there are many things going on behind the scenes, but good overview.
Ответитьcontent is nice, well explained.
BUT
the camera and editor are so bad.
We are not here for a documentary, the computer shot from her shoulder is completely useless and distracting, if you want to use your cuts, use something like the picture in picture but please let us focus on the code!!
What is the architectural difference between spark and map reduce ?
ОтветитьWow congrats on the content. You were able to explain it in a concise, yet logical and detailed way. nice
ОтветитьIt's so clear and easy after the explanation! I will be waiting for more vids about clustering and distributed computing)
ОтветитьI wish she also talked a little about Spark's ability to deal with data streams
ОтветитьI really love your videos I would like to know if it is possible to watch them in French or at least with subtitles so that we can follow
ОтветитьSorry for redundancy, just verifying my understanding. Do I understand it correctly that (when running this example in a cluster) collect runs the 'reduceByKey' against the results on each node, and then reduces to a final result. Say on Node 1 I have count of word 'something' = 5 , on Node 2 I have count of word 'something' = 3, then collect combines from those two nodes into a count of 'something' = 8, And so on...?
ОтветитьIs there any meta analysis on the usefulness of bigdata analysis? How often do jobs get run that either produce no meaningful data or don't produce any statistically significant data?
ОтветитьWould have liked it to be a bit more in-depth and technical, was too high level.
ОтветитьPlease show some drawings or animations of data going back and forth between the noded.
ОтветитьPlease give time measurements comparing single node with multi node execution. What is the overhead?
ОтветитьThanks, nice vid.
ОтветитьShe's mumbling in the beginning... can't really hear her (American-born English speaker)
ОтветитьWhat programming language is she using??
Ответитьtypo in line 32 for using `splitLines` instead of `word`?
Ответитьits bit silly but i cant understand 100% because english isnt my first language , hope someone could add english subs on every this channel videos because i found computerphile videos are easy to understanding because excellent explanation
ОтветитьLooks like you could do a search engine in that.
ОтветитьThese data ones are really good! Keep them coming!
ОтветитьThank you for teaching an old man new things.
ОтветитьMore like this!!!!!!
Ответитьwoohooo rebecca is back
ОтветитьWas so excited to see this posted :) I'm a Cassandra professional.
ОтветитьMore of these, please. More big data.
ОтветитьShe's damn good at explaining and easy to listen to, any plans of having her host other episodes?
(sorry for "her" I don't know her name).
The RDD API is outmoded as of Spark 2.0 and in almost every use case you should be using the Dataset API. You lose out on a lot of improvements and optimizations using RDDs instead of Datasets.
ОтветитьA great example of how programming languages are a reasonably efficient mechanism to communicate sections of program and how natural language really is not.
ОтветитьApache Flink next please
Ответитьahh.. so refreshing after taking a week break from dev work and staying away from non dev topics. Lol, I love our field. Like music to my ears
ОтветитьFor anyone interested, although the documentation is awful for Apache Flink and it doesn't support Java versions beyond 8, it at least lets you do setup on each node. Spark does not have any functionality for running one-time setup on each node, which makes it infeasible for many use cases. These distributed processing frameworks are quite opinionated and if you're not doing word count or streaming data from one input stream to another with very simple stateless transformations in between you'll find little in the documentation or functionality. They're not really designed for use cases where you have a parallel program with a fixed size data source known in advance and want to scale it up as you would by adding more threads, but more for continuous data processing.
ОтветитьI study bioinformatics handling txt files many gigabytes in size and this could be so handy
ОтветитьThe first time I learned about Apache Spark, I was looking up documentation for another framework named Spark.
Ответитьreally good summary thankyou!
Ответитьtotally lost me 3 min into this video.
ОтветитьDo a video explaining AES!
Ответитьfeels like this video is four years too late ... :-/
ОтветитьThank you so much. This was an incredible explanation
ОтветитьOhhh, she is using VSCode! I love VS Code :D
ОтветитьReally interesting video! I have done some MapReduce before, but I haven’t came across Apache Spark
ОтветитьGood old Scala.
ОтветитьWhere are the extra bits?
Ответитьnote to the editor: please stop cutting away from the code so quickly. we're trying to follow along in the code based on what she's saying. at that moment, we don't need to cut back to the shot of her face. we can still hear her voice in the voiceover.
ОтветитьCan you do Apache Kafka next? How do they compare?
Ответить