Hey AI Explorer,
Welcome back to the latest edition of your Friday newsletter “Solve with AI”.
Note: This is a longer post with many images so if you are reading directly from your email box it may get truncated. Go to Substack to read the full post.
The AI buzz is at its peak with individuals and corporations rushing to get their hands on the latest Gen AI models and devices.
Google recently launched their Pixel 8a phone which is an affordable version (“a” denotes the affordable aspect) of the Pixel 8 that was launched last fall.
According to Google, the Pixel 8a comes with native AI capabilities including the Gemini Nano model that can run without a wifi or data connection.
Google and Open AI are competing with each other to create the fastest, most accurate, and most intelligent Gen AI models.
This newsletter is sponsored by Nord VPN.
Get 74% off + 3 months extra only for Solve with AI subscribers.
I use Nord VPN across all of my personal and household devices to protect me and my loved ones from cyber threats. Especially, when I am traveling and using third-party Wi-fi.
At the surface, their latest Gen AI models i.e. ChatGPT 4o and Gemini 1.5 Pro (a.k.a Advanced) look similar.
Source: Linkedin
Reality is quite different.
How?
AI models can be put to the test in various ways:
Validation and test data sets: Use a data set such as any open source city or country population data to test the model's accuracy and performance.
Using Tools: Tools like confusion matrix, ROC curve, AUC score, or error analysis to visualize and interpret the model's results.
Integration testing: Check the compatibility and interoperability of AI/ML models with other systems and components.
Crowd testing: Use a large, diverse group of people to test under real-world conditions. This is a popular method used by most model developing companies like Google and OpenAI where they anonymously record your actions to test against others. Impacts user privacy in many ways.
Data validation: Test how the model behaves and makes predictions based on new information. Unlike robust machine learning weather models developed by IBM, can a Gen AI model update its weather prediction when the pattern changes?
Automated testing: Use specialized software tools and scripts to execute test cases efficiently.
Unless you have a data science background, most of these types of tests will not be easy to do. You will have to learn how to test and interpret the results.
What’s the fun in that?
Don’t worry, there are ways you can test AI models using the Chat User Interface (CUI).
It is more fun if you use the UI to test these models because of the response you get. To take it up a notch, I will test ChatGPT 4o and Gemini Advanced to see how it compares.
To make the test rigorous, I want to use the following 4 areas of testing:
Common sense reasoning: This type of test evaluates the AI model's ability to apply everyday knowledge to solve problems.
These questions often require understanding beyond strict factual knowledge, requiring the model to interpret and reason about real-world scenarios. Calculating travel time, mixing colors and weight comparison are a few examples.
Logical thinking and puzzles: This tests the AI model's problem-solving ability through deductive reasoning, pattern recognition, and logical consistency.
These questions often involve understanding sequences, and spatial relationships, and interpreting tricky or paradoxical scenarios. Multiplication tricks, counting objects, and puzzles related to age are a few examples.
Simple to advanced mathematics: Simple mathematical and logical problems test the AI model's proficiency in basic arithmetic, mathematical reasoning, and applying logical rules to derive correct answers.
These questions typically involve straightforward calculations or logical deductions based on given rules.
Visual Test: Testing the AI model’s ability to understand graphics and images. These questions often require understanding visual cues, spatial orientation, and perceptual consistency.
Let’s put the models to the test now.
I have subscribed to premium/paid versions of both Open AI and Google Gemini to access the advanced features.
You can perform similar tests using ChatGPT 3 and Gemini Basic free versions if you prefer. Your responses will be different from mine.
Rules: For each test, I will type the specific questions into the Chat UI and take a screenshot of the response. Then, we will decide which model is more accurate and give it a score. 1 point for correct and 0 for incorrect answers.
Common sense reasoning
Calculate Travel Time
If it takes 2 hours to travel 60 miles, how long will it take to travel 90 miles at the same speed?
ChatGPT 4o
Gemini Advanced
Both answers are correct so both will get 1 point for this test.
Magic Room Test
In a magical building, if you enter a room and then come out and enter the room on the opposite side, what will you find there if the room you started in was Room 3?
ChatGPT 4o
I am impressed. ChatGPT 4o makes an attempt to identify different scenarios instead of trying to solve the problem. 1 point.
Gemini Advanced
Gemini responses are also intelligent but not similar to ChatGPT. I see more creativity in Gemini’s response. I will give it a 1.
Disappearing Act
If you have a basket that can hold 5 oranges, and you put 3 apples and 2 oranges in it, where are the apples?
ChatGPT 4o
Gemini Advanced
So far, both models are competing head to head and we have a tie with 1 point each.
Let’s make the questions a bit more confusing.
Weight Comparison
What weighs more: a pound of feathers or a pound of steel, considering that feathers are typically measured in avoirdupois pounds and steel in troy pounds?
Instead of tricking the models with the same weight of two objects, I introduced the different units of measure.
ChatGPT 4o
Gemini Advanced
Woah!
1 point for each again.
This is getting interesting.
Instruction Following
Write 10 sentences where each sentence starts with the word "banana" and ends with the word "tree," ensuring that each sentence has exactly 12 words.
ChatGPT 4o
Failed! ChatGPT 4o had a great start with the first sentence with 12 words but the rest of the sentences either have 11 or 10 words. The last one has 12 words.
Zero points.
Gemini Advanced
Gemini did even worse. The very first sentence only has 8 words. The last sentence has 13 words.
Zero points.
Object Location
There is a box with no top, bottom, or sides. If you place a ball in the box, where is the ball considering the box's paradoxical nature?
Logical Thinking and Puzzles
Efficiency Puzzle
If it takes 4 men 8 hours to build a wall, how long would it take 2 men to build half of that wall if the efficiency of work decreases by 20% due to fewer workers?
ChatGPT 4o
Great! ChatGPT gets 1 point.
Gemini Advanced
Wow! 1 point for Gemini as well.
Direction Puzzle
You are facing north. If you turn 90 degrees to your right, walk 100 steps, turn 90 degrees to your left, walk 50 steps, turn 180 degrees, and walk 50 steps, which direction are you facing?
ChatGPT 4o
Gemini Advanced
Both get 1 point. I had a preconceived assumption that Google would perform worse than ChatGPT. Looks like they have made strong progress since the model backlash a few months ago.
Counting Objects
There are 5 apples in a basket. You take away 3 apples, but later you find 2 more apples in your hand that you didn't take initially. How many apples are
now in the basket?
I am willing to bet one of the models will not perform correctly to this question.
ChatGPT 4o
The great answer gets ChatGPT 1 point.
Gemini Advanced
Oh no! Gemini dropped the ball.
Where did the 4 apples come from?
Zero points.
ChatGPT is now in the lead with 6 and Gemini with 5
3 more to go.
Let’s switch to mathematics.
Mathematics
Patten Recognition
Determine the next term in the sequence: 3,6,18,72, if each term is obtained by multiplying the previous term by an increasing integer sequence starting from 1.
ChatGPT 4o
Gemini Advanced
That’s fantastic. Both models are extremely capable of solving simple mathematics problems.
Let’s go to our last test. Testing the ability to produce an accurate image based on the description.
Visual and Perceptual Questions
Visual Test
Show an image of a half-full glass with a reflection in a mirror and ask, is the glass half full or half empty, considering the perspective from the mirror's reflection?
ChatGPT 4o
Fail! The image is not showing a half-full glass. No points for ChatGPT
Gemini Advanced
Fail! Gemini is not showing any mirror at all.
Conclusion:
First, our winner ChatGPT 4o by 1 point.
Based on the above tests, here are my conclusions:
Both Google and Open AI have made significant progress in advancing their respective models.
ChatGPT 4o is designed to break down complex problems into detailed steps to produce the result.
Gemini Advanced has a creative edge but more or less produces similar results.
Both models are robust in solving logical and simple mathematical problems. Both models depict human-like reasoning skills.
The image production capabilities of both models are good but lack attention to detail. None of the models are accurate in producing the image based on the exact ask.
There you have it. I would love to see in the comment section below what trick questions you come up with to test these models.
Talk Soon,
Creator of Solve with AI.