Submitted by opossumchampagne in just_post
Hello and Peace Be Upon the Fempire. I love Just Post, I love Postmill, and I love you. I'm here to share something Neat I Noticed: https://arxiv.org/html/2410.19100v1
Bad news: It's a deep learning research paper, i.e. AI. eww
Neat news: This is a paper meant to ingest YouTube videos with tutorials for Reddit, GitHub, etc. and perform actions based on them. Based on the text of the paper and the associated code, they ran their "Reddit" bot against a Postmill instance!
More details: It turns out, the "Reddit" they used is from Web-Arena-X's gyms (see: the WebArena and the VisualWebArena), using the Apache and MIT license, respectively. Here is the link where Carnegie Mellon hosts their Postmill fork. It's 50GB, "populated-exposed-withimg" makes me think that might be scraped Reddit data?
Other neat news: I found this Postmill which is a Reddit mirror running on a public EC2. Someone scraped Reddit and rehosted it on this EC2. Neat.
I'm not sure if this violates Postmill's license or not, and I wouldn't necessarily trust it's all above board. PhD students generally don't get a lot of oversight over whether they're adhering to open-source licenses or not, and a lot of them are people in their 20s who have never written software before and have never thought about legal terms.
Anyways, once again, Peace Be Upon The Fempire, I love you