Reddit up/downvote records have more psychometric value than Facebook and twitter.

18  2018-04-07 by CelineHagbard

I'm going to make the argument that an individual's up/downvote behavior tells you more about they're psychological makeup than nearly any other platform. This is the real value of reddit.

Twitter and Facebook are more tied to identity, but also have some degree of pseudo-anonymity. Users like and share and retweet publicly, so it shows the face the user chooses to show. With reddit, the votes are confidential (to other users at least) and the accounts are less tied to identity (at least overtly), so people are much more comfortable expressing their true opinions about posts or comments, because they're less worried about what other people think they think. Consequently, the behavior is more indicative of the person's underlying psychic life.

If I had access to reddit's backend, and took a list of all the posts and comments a user has upvoted, downvoted, or hidden, I would know a lot about that user. For a moderately active reddit user, especially lurkers who vote often, I would have thousands if not tens of thousands of opinions that person either agreed or disagreed with. With machine learning on this data, I would be able to predict all sorts of things about this person: their politics, their religion, their interests, their temperament or disposition, their preferences and views, down to some pretty specific detail.

This would give me a very accurate psychological profile on a user, far more useful than anything Cambridge Analytica came away with, I'd say at least an order of magnitude. And no one's even talking about what's happening with this data.

58 comments

It's easier to judge the more nuanced parts of a person's mindset if you present them with a thread - and like 50 different replies within it and they answer "yes/no|like/hate" for every one - each thread is a 50 question personality quiz.

You are very correct.

Yeah, I know. Reddit is getting people to willingly fill out psychological questionnaires about a very broad range of topics that capture their interest, and people are doing it for free. The questionnaires CA got a hold of were primitive compared to this.

I guess I've always known this, but the realization is just now hitting me how much information and knowledge this actually provides. With a massive corpus of preferential data about each user, and the whole dataset about all the users in the system, the machine learning implications are huge. You can apply filters to cancel out noise or filter out bots from the dataset; recognize common or unique patterns on narrow or broad scales; predict affinity for certain pieces of content for either individuals or sets of individuals; and in combination with entities, you could match this specific psychological data with the user's identity. The possibilities are nearly endless.

Reddit is the perfect petri dish for studying how ideas, how memes in the original Dawkinsian sense, actually propagate and mutate across the network. There have been academic studies done on reddit content, but those are just scratching the surface, as all the vote data was anonymized. If the owners and partners of reddit are smart with how they use this data, and can keep the knowledge that they have this data away from the users, they are sitting on an absolute goldmine.

I just wonder who's using it and to what ends.

I just wonder who's using it and to what ends.

Whoever actually owns Advanced Publishing (they own Conde Nast) - someone whose name is not even on the internet... and probably for use by another company they own or someone they know owns - that traffics in sorting users into 3+ categories of "risk".

Like you said, you've always known this.

I doubt it's any one person or entity; it's the entire bureaucracy in a way. Sure, some factions likely have more access to the data than others, but it all gets used for the same control program.

that traffics in sorting users into 3+ categories of "risk".

I think levels of risk is also just scratching the surface. This is multi-dimensional analysis that can know more about you than you or your closest friends know about you. And with unstructured machine learning, they can find patterns they never would have thought to look for and answers to questions they never asked.

The practical applications are far-reaching: Microtargetting of individuals for advertising or propaganda purposes far more accurate and precise than CA, learning what levels techniques of propaganda individuals and populations are susceptible to (e.g. testing the waters for BlueBeam or similar), gauging and molding popular understanding nearly any aspect of society.

is there anything they cant squeeze data from anymore?

I bet the owner is just like their parakeet or something

Advance is massive.

Advance is massive. And thte wikipedia article about it is shitty.

I'm kinda relieved I never got in the habit of voting on stories

I'm sure they still get loads off how long you stay on each page and which key words bait your clicks

You're feeding the ai daddy warbucks will use as our slave master. Truth in jest

Sometimes though I'll upvote an opinion I don't agree with to ensure my reply gets exposure. Could be complex to delineate intention.

I’ve noticed that Reddit doesn’t like anything cynical, flippant, or negative. On Facebook I get likes and comments like “I’m going to hell for laughing at this.” Here it’s all down votes unless the content is all sunshine and daisies.

Then combine with the data from fb. Now you can tell the difference between what people say and what they think. Scary!

Yep, that adds a whole nother layer to it. If you can correlate the private thoughts with the public face, you can classify and predict the discrepancy.

EGO/ID

Responding to a deleted comment I already wrote a response to:


I don't know, I see cynical, flippant, and negative comments do very well if they agree with that subculture's general positions: for example, cynical or mocking comments about Trump in r/politics or most other subs do well, as do cynical or mocking comments about Clinton or the deep state in subs like here or T_D. Or when a company like United or Comcast does something unpopular and everyone gets upvoted for hating them.

Two minutes hate is very popular here, as long as you hate the right thing.

I never thought about this cause I don't really use the function but yea, totally. Kind of disturbing actually. Sure beats the census though!

I have been saying this for awhile. Our voting behavior is killer data. With the right analysis, I'm sure they can predict our behavior to a pretty fine point by now. I try to upvote everything I touch, even things i wouldn't, just to hide my mind from them. But, i don't think i have done the best job. And even still, they know everything that catches my eye. That is probably gold all by itself.

If I had access to reddit's backend, and took a list of all the posts and comments a user has upvoted, downvoted, or hidden, I would know a lot about that user. For a moderately active reddit user, especially lurkers who vote often,

Most reddit users never give this a thought. So you've got zero awareness which makes the data very useful.

There's a tool that you can use that does exactly this kind of analysis, but in a limited way. My guess is that there is data somewhere about user's activity that could be analyzed 6 ways from Sunday.

It's supposed to be for checking out your own account, but I've used it on occasion to check out potential shills. One really useful feature is the word-cloud analysis.

This one shows you which words a user uses most often. It's not quite mind-reading, but it's not that far off either.

Yeah, snoopsnoo is kind of cool, and using the API (and a 3rd party database called pushshift) you can get even more data and run what ever kind of analysis you want on it. But the votes themselves are completely hidden from the public unless you specifically make them public. That data is only known to reddit anyone they share it with.

This one shows you which words a user uses most often. It's not quite mind-reading, but it's not that far off either.

Knowing the votes is even closer to true mind reading, because you can also measure a user's affinity to certain words, topics, ideas, positions, and interests. Most users vote orders of magnitude more than they comment or post.

I'm considering writing a script that can go back through and unvote for every post/comment I've voted on, but I imagine there are limitations on how far back it goes, and reddit might also keep the historical data.

I doubt Reddit knows my friends (both virtual and IRL), knows the content of my messages, and has the processing power to compile profiles (including shadow profil s) from this

Reddit? No, probably not. Whoever they might be working with, including government and quasi-government agencies? You bet they do. If they're not processing this now, you can bet they will be.

The concept is so annoying to me I just stopped doing it like 2 years ago. I sincerely look down on people who downvote now obsessively, like when you're in a back and forth with someone and they just downvote you the second you reply.

I always wonder if like they assume I'm downvoting them too and since they have one point they think that somebody is upboting everything they say.

I still vote, especially on my own threads or replies to me, even if I don't necessarily agree I'll give an upvote. I only downvote if someone's actively detracting from the discussion at hand.

since they have one point they think that somebody is upboting everything they say.

I think most people who've been here for longer than a week or so understand that the first upvote is your own.

lol wait isn;t this the study that is supposed to be providing a better psychometric evaluation than facebook?

It's not any study; it's just the data they have, and it's only valuable if they can pull useful knowledge out of the data. Some users are going to have behavior patterns which don't give them much useful information, but many users will show many of their preferences and affinities.

On the whole, the dataset is quite valuable, and even more so if usernames can be tied to other accounts or real identities.

dude these people don't know shit. I've been sayng this the whole time. All these outlandish cliams we here are sales pitches, you know what i mean? If this data was that valuable thhey should be able to detect like whos a serial killer or will become one.

Like what do they really know? your political leanings? I think they are all grossly exxgerating their claims. they're salesman

Question is why ppl up vote stuff they dont like.

I generally upvote posts if they add value to the discussion, even if I disagree with them. I think a good amount of redditors do the same.

I agree. I also think that AskReddit is a huge way to gather information. Even though most people say the replies to that sub are mostly lies, but still.

With my premise, 90% of the posts and comments could be lies, and the vote data would still provide useful information, maybe even more useful as the "survey questions" could be better designed to gather the desired information.

Yeah I agree. Even with a certain amount of time spent on a page or looking at an image.

This is the best explanation I've seen of the true ability to monetize, and in my opinion weaponize, reddit. People are rage quitting facebook because they believe their privacy was violated by a few ads and finding their political affiliation in their user profile. Meanwhile, reddit has as good, maybe even better, a profile of you than buyer rewards and credit card data. You don't buy everything you want a store where you have a buyer rewards card, say, but you can upvote or downvote for free.

The monetary value aside, I am certain somewhere the CIA, FBI, NSA, some three letter agency has done this already. I used to think the three letters crawled the site with a keyword algorithm, which they do, but this upvote downvote data is far easier to analyze and paints a better picture. Some people are well versed in English, others not so, but upvotes and downvotes are a simple binary 0 or 1.

Great post. It gives me something to think about.

The monetary value aside, I am certain somewhere the CIA, FBI, NSA, some three letter agency has done this already.

If one of these agencies has access to the actual vote data, it would have to be because either reddit gave it to them (willingly or coerced), reddit fucked up seriously in their security practices, or HTTPS is compromised (which is possible).

Yeah that's true, but there's also people like me who basically forget about up voting and down voting 90% of the time and only remember in short spurts once a month. That could skew the data and maybe not make it useless, but at least less reliable.

Jokes on you I actually follow reddiquette

In addition to that it is easy to shoot down truth into downvotes and prop up your own agenda into upvotes. The mind uses the law of perceived authority to determine what short cut conclusions to believe on subjects you are not expert enough on or going to give time to learning.

Also easy to target key individuals with to much sway like MLK was for character asassination.

The only reason we are still using Reddit is because of addiction and lack of safe alternatives. There will be privacy centric block chain based internet agregators... Also there will be pay to play micropayment ones that will limit spam and shilling or at least allow you to get paid for their shilling and make it cost prohibitive

The only reason we are still using Reddit is because of addiction and lack of safe alternatives.

The same with FB I think it's less that we lack real alternatives, or are addicted to this platform per se, but that so many people use this platform. If you want to communicate with people that use reddit, you pretty much have to use reddit at this point. Alternatives exist, but they just don't have the same reach.

I care more for quality than quantity

As do I, which is why I find smaller subs to be better, at least as far as signal-to-noise ratio goes. I would say that as a whole, reddit has some absolutely excellent content in pretty much any area of interest you may have, but the real problem is finding it amongst all the lower quality content (however you define that).

Other sites may have a higher signal-to-noise ratio, but less quality content in total simply due to the smaller user base to pull from.

OP YES!

You don't think Reddit sells that data for marketing research purposes? It's been going on for years. That's why you need to always use a VPN, always scrub your comments (over-write plus delete), and switch your username.

If you used a real email to verify your account, that's just idiotic...It's NOT JUST Marketing companies that have this data...Think intelligence services.

always scrub your comments (over-write plus delete)

I'd have to see how their backend is written, but this likely isn't enough, and if reddit isn't recording original comments, you can be sure 3rd parties are. As far as VPN and username switching, it will help against internet detectives, but one slip up from perfect infosec and a sufficiently determined adversary can connect you to all your accounts.

I'm not particularly worried about this as far as it goes; I just use reddit with the knowledge that everything I'm typing can and likely will be tied to me. Reddit is not and should not be thought of as an anonymous platform, which is more what I feel the takeaway should be. If you want real anonymity, onions and zeronode with appropriate precautions are the best bet, but even with that it's more security by obscurity than true security.

No, no, no...I mean, you keep an account for a few months, then you scrub it and move on to a new account that had no associations with the old account and never go back to the old one. That makes linking accounts together MUCH harder. But it's not "internet detective" i.e. someone trying to dox you for what-ever reason you have to worry about, it's Reddit providing the info willingly to what-ever agency is asking for it (or even selling it to them).

Agree, it is likely that all your original comments are being logged somewhere, however, at one time, Reddit was open source and it was known that they only recorded the "last edit" on every comment. Hence you over-write the comment with nonsense and then delete. Who knows what is happening now?

No, no, no...I mean, you keep an account for a few months, then you scrub it and move on to a new account that had no associations with the old account

Yeah, I know what you meant, but I'm saying even with that, a sufficiently motivated adversary could tie you to the accounts if you made any deviation from best security practices. At the very least, r/pushshift (ceddit backend) records every public comment in real-time, so unless you're scrubbing comments at the end of every month before it takes a final snapshot, those comments are tied to that username indefinitely, and publicly available. NSA and CIA are almost certainly saving every edit as well.

And as far as linking accounts, there are other statistical methods that used in conjunction could be pretty effective. If you're using a desktop browser, browser fingerprints are fairly effective at this. If you're using the official apps, they're almost certainly matching accounts to devices directly. If you're using a closed-source third-party app, you have no clue what they're doing with your data. Open source apps or third party API clients are probably the best bet, but the smaller userbase will make you stick out more. Voting patterns, patterns in usage time, linguistic patterns, etc. can also provide useful classification data.

I'm not saying you can't or shouldn't try to obscure your identity, but if NSA wants to identify you, I don't think they'll have a hard time at it. I'm just saying use the platform with that understanding.

All good points and I agree, we should consider the entire internet "public" at this point, there is no privacy and we should act as such.

Pattern analysis between account comments, even if you use separate VPN outlets for different accounts, one in Sweden, and one in Pakistan, can never be defeated even if you purposely are able to maintain a different subreddit pattern activity as a front and know how to write in a style that is different enough to try to trick pattern analysis algorithms, which are nearly guaranteed to be more sophisticated at this type of trickery than we realize.

Pattern analysis between account comments... can never be defeated

Yeah, and I think that even if you can beat the current classifiers, the technology is constantly improving such that in a year or five or ten, they'll almost certainly beat you if they want to. Even matching just three or four independent feature vectors could be enough if they're each unique enough.

All good points and I agree, we should consider the entire internet "public" at this point, there is no privacy and we should act as such.

At least any part of the internet that is not expressly focused on privacy, and maintained and audited by multiple independent parties with a vested interest in privacy. And even then, that privacy should only be partially trusted for a finite amount of time with the expectation that your encryption and privacy scheme will be broken, and past communications will be compromised as well.

If you really want security and privacy, you almost have to go old school. USPS is probably your best bet for sending secure communications over a distance if you can be reasonably sure you're not currently under active surveillance. Even the big boys like NSA will only interfere with that if they have to.

Good points. But how are they tying user data to real identities?

On their own, reddit can tie accounts to IP addresses, devices, browser signatures, the app you use (which possibly provides device information), email if you give it to them.

In many cases, this probably isn't enough to make a positive identification, but it does create a feature vector that can be compared to vectors from other datasets. They could be partnering with other data firms, be purchasing access to other datasets, or be giving data to government agencies, either wittingly or not.

If NSA has full access to reddit internals (which they very likely might), they can easily cross-reference it with their other sources and probably be reasonably confident in identifying 95% of accounts, especially accounts that are active for a period of some months.

devices, browser signatures

Any spoofing recommendations for these?

I looked into it a while ago, and there are some best practices, but the detection methods are pretty good. Even something like having different fonts installed on your machine or different browser extensions represent a feature which can be used to narrow the pool of users with similar browsers.

Your best bet is to make your browser appear like a lot of other people's browsers. A fresh Windows 10 install with an up to date, barebones chrome is likely the best option, but not having adblock or ghostery/disconnect leaves you open to other attack vectors. Anything you do hide yourself consequently puts you in a category of people trying to hide themselves, which narrows the pool, especially if you try to hide yourself in a different way than everyone else. I really think it's a crapshoot.

But I guess downloading webpages via curl in a *nix terminal and rendering your pages offline would work, but you won't have any javascript functionality, and most interactive sites like reddit would be mostly broken.

Thanks Howard. I reckon noscript, uMatrix and ublock are pretty commonly used at this point.

Thanks Howard.

;D Still singing!

I reckon noscript, uMatrix and ublock etc. are pretty commonly used at this point.

Yeah, probably, even in that combination there are probably hundreds of thousands if not millions of users. But then within that group, if you have noscript set to disable or enable a feature that a server can detect, that distinguishes you further within that group. There's a site somewhere that will tell you how unique your fingerprint is. You can probably tweak your set up and compare them to find something that minimizes your uniqueness profile.

Edit: And I would not recommend chrome when data protection is the goal.

For sure. I just picked chrome because it's by far the most popular, and therefore least unique, browser out there. Maybe Edge on Win10 is actually better just because it's completely stock, so you'd be lumped in with a bunch of people who never touch their system.

There's so many tradeoffs that you have to weigh what makes the most sense for you, and which privacy concerns you want to prioritize over others (e.g. identifiability vs. data leakage vs. data theft).

I don't even bother up/down voting. Meaningless to me.

Have an upboat!

Highly likely, wouldn't shock me in the slightest. However I'm not even gonna try to filter myself. I'm already in too deep, I wasn't careful so when you google my name the first thing that shows up is me with some weed plants lool

Better spam protection on Reddit as well. Much better.

Negtives aside, would this not help to better understand humans in modern society in general? Maybe it can help our society as a whole better itself if the data is being used.

Pattern analysis between account comments... can never be defeated

Yeah, and I think that even if you can beat the current classifiers, the technology is constantly improving such that in a year or five or ten, they'll almost certainly beat you if they want to. Even matching just three or four independent feature vectors could be enough if they're each unique enough.

All good points and I agree, we should consider the entire internet "public" at this point, there is no privacy and we should act as such.

At least any part of the internet that is not expressly focused on privacy, and maintained and audited by multiple independent parties with a vested interest in privacy. And even then, that privacy should only be partially trusted for a finite amount of time with the expectation that your encryption and privacy scheme will be broken, and past communications will be compromised as well.

If you really want security and privacy, you almost have to go old school. USPS is probably your best bet for sending secure communications over a distance if you can be reasonably sure you're not currently under active surveillance. Even the big boys like NSA will only interfere with that if they have to.