One of my most popular articles is the one on target encoding. It gets a fair amount of mentions on Kaggle discussions and I see it pop up from time to time in other contexts. It also brings brought me around 2500 unique monthly viewers. That’s quite a chunk of people for an unambitious blogger like me. Up to a few months ago, my article was on the first page of Google when you typed in searches such as “target encoding python” and “bayesian target encoding". This was purely organic and it felt nice to have a relevant article, even though that’s not the main reason why I blog.

It turns out that a shameless person named Venkatasai Katuru copy/pasted word for word my article and posted it on Medium. He even took the same title. Medium doesn’t support LaTeX equations, so he took screenshots of the relevant equations from my article. He really went the whole nine yards and didn’t even bother to ask me or to cite me.

The offending article was posted on October 7th, 2019. However, I was only made aware of this on February 5th, 2020 thanks to an email sent to me by Piotr Tempczyk – whom I thank very much.

I wasn’t aware of the plagirism earlier on simply because I never check for it. In any case I suppose doing so would be very time consuming, and I’m not even sure how I would go about doing it. In fact, it would be such a pain in the neck that I would expect Google to handle this for me in its search engine. It’s probably a resource-intensive task to perform, but then again Google is not a small indie company. They already do perceptual hashing to filter out duplicates in image searches, so they certainly have the ability to detect plagiarisms and downrank them.

Venkatasai Katuru submitted his article to Medium account of Analytics Vidhya and got approved. This gave enough exposure to his article to be ranked first in the Google keyword searches I mentioned above. I therefore contacted the Analytics Vidhya team and kindly asked them to remove the offending article. They responded in a timely manner and did so. But alas they were not able to delete the article from Venkatasai Katuru’s account. I guess that’s just the way Medium works and I’m fine with it, even though it’s debatable. The issue is that Venkatasai Katuru’s article is now the one that’s getting ranked first on Google. What’s ironic is that you can my article mentioned in the next result from a Kaggle discussion.

Meanwhile, my article is buried deep down in Google’s results, even though I am the original author. I resorted to contact Venkatasai Katuru directly via LinkedIn. To my good surprise, he responded rather quickly. He explained to me that he needed to have the article up in order to get enrolled in some educational program back in his country. I told him that that was fine and all, but that I can’t condone his behavior and that he needs to take the article down. Since then he’s been stringing me by coming up with excuses that I can’t verify. On top of that the COVID-19 crisis hasn’t helped speed things up. I have therefore contacted Medium and filed a formal complaint. I could have done this ages ago but I wasn’t in a rush. There are worse things happening to people and I have other fish to fry. I guess that plagiarism is the flipside to popular articles and it probably happens all the time. But eventually I assume and hope that justice will be served.

My biggest surprise with this event is just the fact that it happened in the first place. As I mentioned, Google clearly has the technology to detect that two web pages contain the same content. After all, the core of their business is to index documents. In this case, the plagiarism is blatent because Venkatasai Katuru copied my article word for word. Both articles are clearly timestamped, so it’s easy to identify the original poster. If you go to the comments section of the Medium article, all the comments are calling out Venkatasai Katuru and identify me as the original author.

It is thus clear to anyone reading this that plagirism occurred. And yet, Google’s oh so mighty search engine doesn’t seem to ackwnowledge this. I suppose that the fact that the article was posted on Medium gives credence to Venkatasai Katuru’s article. In other words, in Google’s ranking algorithm, the domain name is more important than the actual dates that both articles were published at. I think that’s really sad, and is just an example of how the Internet is getting more and more narrow. In this case, Google is essentially coercing readers towards a single pay-walled website and is downplaying my article because it’s a personal blog. Although this is one single case, I think it sheds some light on the state of things in the search engine world. Some websites have become so good at playing the SEO game that it’s interfering with the quality and the veracity of the content we are presented with. Google is apparently using fancy deep learning algorithms to improve its search results, but in this case it is not even able to get the basics right. I’m not the only one to share this view.

There are many more important things in life, especially during the current pandemic. Hopefully, everything will sort itself out in the end.